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Preface 


Origins 

In 1957 a grant was made to the National Bureau of Standards, by 
the National Science Foundation, for the support of a Training Program 
in Numerical Analysis for Senior University Staff, under my direction. 
An objective of this program was to attract mature mathematicians 
into an area of vital importance which had been largely neglected. 
The first chapter of this book tries to show that numerical analysis is an 
attractive subject in which mathematics of practically all sorts can be used 
significantly, and from which many branches of mathematics can benefit. 

After this was concluded it was decided to follow a suggestion of Dr. 
Olga Taussky and to develop the lectures given there into a book 
entitled “Survey of Numerical Analysis.” Unfortunately, for various 
reasons not all the speakers who took part in the program participated 
in the development of the book, and there are some gaps.* In order 
not to affect the unity of the program, it was decided not to attempt 
to fill these gaps by including new contributions.f However, ample 
material is included for an introductory course, as well as representative 
chapters for advanced courses in numerical analysis and in supporting 
mathematics. 

The authors are grateful to both organizations for the opportunity 
to present their ideas orally, and to their teachers, colleagues, and 
pupils for help in the later development. 


Activities of Numerical Analysts 


It is appropriate to discuss briefly what the activities of a numerical 
analyst should be. In addition to considering the exploitation of 


* Several of the gaps have been covered by excellent monographs which have 
appeared recently. They cover, for example, such subjects as asymptotics, com- 
putability and unsolvability, initial-value problems, and linear programming. 

J We note that Dr. Walter Gautschi and Dr. Werner C. Rheinboldt, who took 
part in the repetition of the Training Program (which took place in 1959, under the 
direction of Dr. Philip J. Davis), collaborated with Prof. H. A. Antosiewicz on 
Chapters 9 and 14. 
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automatic computers in new areas, he should be concerned with the 
solution of classes of problems: e.g., the solution of systems of linear 
equations, or the solution of ordinary differential equations. As well 
as reexamining old methods in the light of available equipment, he 
should be devising and evaluating new methods. Since, in general, it 
will be impossible for him to give the methods a complete theoretical 
examination, he should carry out controlled computational experiments, 
in which, for instance, he compares the observed errors with his 
theoretical estimates for realism. These experiments should be recorded 
and analyzed. Finally, he should construct and discuss “bad examples.” 

Such material, when combined with the experience of computers 
and the intuition of the customer, will be invaluable when the methods 
are being applied in practice, beyond the regions in which they are 
secure in the sense of classical mathematics. 


The Education of Numerical Analysts 


Informal teaching of the use of computers and of numerical analysis 
can begin at a very early stage. Formal teaching is appropriate 
whenever a reasonable background in the calculus and matrix theory 
is achieved—usually in the junior year. The contents of Chapter 3 
and the first part of Chapter 8 are appropriate in a basic science 
curriculum. However, in view of the current tendency to abstraction, 
it may be necessary to incorporate them in the basic numerical analysis 
course. ‘This course should include, in addition, most of the contents 
of Chapters 1, 2, 4, 5, and 6. We have covered this material in a 
two-quarter course, with three lecture hours per week and appropriate 
machine time. 

We believe that there should be no division between theoretical and 
practical numerical analysis, and that a lecture without numerical 
examples is a lecture wasted. ‘The instructor should have had recent 
machine experience and the supervision of practical work should, as 
far as possible, not be delegated. The following general advice was 
given by Prof. G. Pélya* to prospective high school teachers: “‘Acquire, 
and keep up, some aptitude for problem solving.” This is particularly 
relevant here, and to it we would add the further qualification of 
experience in making examples. 

Our worked examples and problems have an academic flavor, but 
this is mainly for brevity. They can be dressed up by the instructor 
according to his taste; for instance, he can relate the calculations of 
the zeros of Bessel functions to the eigenvalues of a differential equation 
and to the frequencies of vibrations of a drumhead. It is not possible 


* G. Polya, On the Curriculum for Prospective High School Teachers, Amer. Math. 
Monthly, vol. 65, pp. 101-104, 1958. 
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to include in a survey significant case studies in, for example, reactor 
engineering, astrophysics, or geophysics. Fortunately, however, mono- 
graphs on such topics are becoming available. 

Only in exceptional circumstances will teaching institutions be able 
to provide computers and computer organizations at the level of the 
best of the governmental and commercial installations. Generally, 
therefore, we recommend that students get experience in such centers 
as soon as they have completed the basic course. After this they will 
be in a better position to appreciate advanced courses. Since the 
practicing numerical analyst meets problems from many different areas, 
one-quarter courses, such as could be based on the material in the later 
chapters, are appropriate rather than more extensive treatments of 
special topics. 

Finally, in view of the rapid developments in the field, students must 
be encouraged from the beginning to get acquainted with the periodical 
literature; for this purpose we have given ample references in the text 
and in the problems. The need for critical reading should be empha- 
sized. 


Remarks 


In a composite work of this character, complete uniformity and 
freedom from overlap is almost impossible to maintain. The known 
inconsistencies in notations and terminology should not disturb the 
reader, and the repetitions are to his advantage. We hope that the 
errors and inaccuracies which have been overlooked will not be 
troublesome. 

In the last decade, the electronic engineers have increased the power 
of our computers about a thousandfold; unfortunately there has been 
no comparable development in the relevant mathematics. We hope 
that this “Survey” will aid such a development; our views on this point 
are elaborated in Chapter 1. Although it may well be that the greatest 
contribution of automatic computers will be outside of the physical 
sciences, there is no doubt that a thorough grounding in mathematics 
and numerical analysis is the best initial training for those concerned 
with the use of computers if they are to avoid the many logical and 
arithmetic perils which await those who use their machines formally 
and uncritically. 
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Motivation for Working 


in Numerical Analysis” 


JOHN TODD 


PROFESSOR OF MATHEMATICS 
CALIFORNIA INSTITUTE OF TECHNOLOGY 


1.1 Introduction 


The profession of numerical analysis is not yet so desirable that it is 
taken up by choice; indeed, although it is one of the oldest professions, it 
is only now becoming respectable. Most of those who are now working 
in this field have been more or less drafted into it, either in World War I 
or in World War II, or more recently. The question at issue is, Why 
have they stayed in this field and not returned to their earlier interests ? 

The answer is that numerical analysis is an attractive subject in which 
mathematics of practically all sorts can be used significantly and from 
which many branches of mathematics can benefit. We call attention 
here to the applications of functional analysis by the Russian school led 
by Kantorovitch [1]. (For a survey of some Western work see Collatz 
[1a]; see also Altman [90].) In another direction we recall the develop- 
ments in analytic number theory by Lehmer and Rademacher which 
followed MacMahon’s computations of p(n) for Hardy and Ramanujan 
[2]. We note here the contribution of machines to a problem on re- 
arrangements in real variable theory due to D. H. Lehmer [91], to the 
theory of finite projective geometries and related fields by Hall and his 


* This is a slightly revised and extended version of the article, with the same title, 
which appeared in Comm. Pure Appl. Math., vol. 8, pp. 97-116, 1955, and which was 
reprinted in ‘“‘Transactions of the Symposium on Computing, Mechanics, Statistics 
and Partial Differential Equations,” F. E. Grubbs, F. J. Murray, and J. J. Stoker, eds., 
Interscience Publishers, Inc., New York, 1955. We are grateful to the publishers for 
permission to reproduce this here. A translation of this article into German, by 
Prof. Dr. E. Kamke, appeared in Jber. Deutsch. Math. Verein., vol. 58, pp. 11-38, 
1955; and a Russian version has appeared in Matematicheskoe prosveshchenie, vol. 1, 
pp. 75-86, 1955, and vol. 2, pp. 97-110, 1956. 
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collaborators (see Chap. 15), to a problem of Taussky [122] in the 
theory of sequence spaces by Kato [101], and to complex-function 
theory by Kreyszig and Todd [93] and Kusmina [94]. 

Before proceeding to a discussion of some individual topics in numer- 
ical analysis, some general remarks are in order. We have, on various 
occasions, distinguished between classical and modern numerical analy- 
sis, the latter being material required in connection with the exploitation 
of high-speed automatic digital computing machines. It now seems 
desirable to recognize ultramodern numerical analysis, which may be 
specified as adventures with high-speed automatic digital computing 
machines (see [50,51]). There are, of course, no sharp boundaries 
between these parts of the subject, and there is room for development in 
the classical phases as well as in the newer areas. 

In distinction to the deliberate explorations contemplated in ultra- 
modern numerical analysis, there is much routine work in numerical 
analysis which must necessarily be of an experimental or empirical 
nature. It is just not feasible to carry out rigorous error estimates for all 
problems of significant complication; it is necessary to place considerable 
reliance, on the one hand, on the experience of those familiar with similar 
problems and, on the other, on the good judgment of the setter of the 
problem. To justify this remark, we consider three examples. The 
solution of systems of 20 or more first-order differential equations is being 
handled regularly. To see the complication of theoretical error esti- 
mates [in which the fact that all numbers handled are finite (binary) 
decimals is disregarded], we refer to Bieberbach [3]. The complication 
of a stability analysis in a system of 14 equations is evident from a study 
carried out by Murray [4]. Again, the extent of a complete error esti- 
mate for the problem of matrix inversions is familiar from the work of 
von Neumann and Goldstine [5, 5a] and Turing [6]. Finally, there are 
the analysis of the triple-diagonal method for determining the character- 
istic roots of a symmetric matrix by Givens [7, 7a] and the analysis of the 
Jacobi diagonalization method by Goldstine, Murray, and von Neumann 
[26]. 

What the numerical analyst has to do 1s to be aware of the precision of 
results obtained from, for instance, the conformal mapping of an ellipse 
on a circle by a certain process and, from these results, to extrapolate to 
cases of regions of comparable shape. On the one hand, he has to ex- 
amine gencral error analyses for their realism by comparison with cases 
where the explicit, exact results are known. On the other hand, he 
must devote time to the construction and study of bad examples so as to 
counteract any tendency to too much extrapolation. Fora preliminary 
discussion of matrix inversion in the last two directions, we refer to New- 
man and Todd [95] and to Todd [76, 77]. 
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The main part of this chapter is devoted to a discussion of some topics 
in numerical analysis which appear attractive. These have been chosen, 
among those with which the author is familiar, to point out some of the 
techniques of the subject and to indicate some of the mathematicians who 
have made distinguished contributions in the field. In addition, the 
choice has been controlled by the author’s opinion that separation be- 
tween theoretical and practical numerical analysis is undesirable. The 
practicality of some of the techniques used is illustrated by computations 
of the radiation from a simple source which is reflected from a Lambert 
plane, recently carried out by Henrici [8], where the ideas of Secs. 1.3 
and 1.6 were used. 


1.2 Evaluation of Polynomials 


What is the best way of computing polynomials, for instance, 
f(x) = agx® oot + ay ak + Gn; 


for a series of values of x, not equally spaced? (In the case where the 
values of f(x) for a series of equally spaced values of x are required, build- 
ing up f(x) from its differences might be the most convenient.) The 
usual answer is to suggest the recurrence scheme: 


So = 4p, 
Sra = fe + yay, r=Qylh<5n =), 


which was known to Newton but is usually ascribed to Horner [9]. In 
this way we get f(x) by n additions and n multiplications. Is this the 
best possible algorithm? Consider an alternative, in the case of 


Ji a De 2 se xe 
If we proceed as follows: 
2x, x?, 3x7, 1 + 2x + 3x2, 


we need 3 multiplications and 2 additions compared with the 2 multipli- 
cations and 2 additions needed in applying the above algorithm; thus 


3x, 3x + 2, x(3x + 2), x(3x + 2) +1. 


This problem was formulated as one in abstract algebra by Ostrowski, 
and he showed [9] that the above algorithm was indeed the best for poly- 
nomials of degree not exceeding 4. A different approach was made 
recently by Motzkin [10] (see also Belaga [98]). Not restricting him- 
self to purely rational processes, he showed that algorithms which are 
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more economical in practice can be obtained for larger n, when a sufh- 
ciently large number of values of f(x) are required. We give a simple 
example in the case n = 6. Consider the evaluation of 


P= x8 + Ax§ + BA + Cx8 + Dx? + Ex + F. 

Introduce the following polynomials: 

P, = xt + ax, 

P, = (Py +x + 6)(P, + ¢), 

Py = (P, + d)(P, + 2); 
and determine a, 4, c, d, e, and f by identifying Pand P, + f. This can 
be done by the solution of linear equations and a single quadratic. This 
evaluation is done once for all, and then P can be evaluated at the ex- 
pense of three multiplications only, with a significant economy over the 
other process if we have to evaluate P for a sufficiently large number of 
values of x. 


The details of the evaluations are as follows. The result of equating 
coefficients in P and P, + /is 


3a+1= A, (1.1) 

344+ 2a+6+¢+¢e=B, (1.2) 

a? 4+ a? 4+ 2ab + 2ac + Qae +e +e =C, (1.3) 

a*h + a*c + ae +ac+ae+hbe+be+ce+d=D, (1.4) 
abc + abe + ace + ad+ce=E, (Lo) 


eho +de+f=F. (1.6) 


From (1.1) we find a. Hence we can rewrite (1.2) and (1.3) in the 
form 

b+c+e=B, O12: ) 

2a(b +e+e) +e+e=C"’. (1.3') 


(We use primed capitals toindicate new known constants.) Using (1.2’) 
in (1.3’), we get 
ete =C’, (1.3”) 


which, with (1.2’), gives us b explicitly. Using a, b,¢ + e, we can write 
(1.4) as 


d+ce= DD’. (1.4’) 
Using a, 6, ¢ + e, d + ce in (1.5), we find 
SE, (1.5’) 


‘) we can find ¢, e by 


which gives d from (1.4'). From (1.3”) and (1.5 
1.6) we can find f. 


solving a quadratic equation, and then from ( 
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1.3 Increasing the Speed of Convergence of Sequences 


The construction of processes which increase the speed of convergence 
of sequences and series has been a favorite topic for many numerical 
analysts. For instance, there is the h? extrapolation process of Richard- 
son [11], the converging-factor method of Airey [12, 12a], the Euler 
summation process [13], and a whole subject associated with the name 
of Chebyshev (see, for example, Chap. 3 and [14]). We shall discuss 
the 6? process which has been popularized in numerical analysis by 
Aitken [15, e.g.]; it dates back at least to Kummer [16]. 

If 


X,—>x 
and xX, —*X = Ad’, [Al <1, (1.7) 


Xn+q — % Xn41 — * 


then ee a. (1.8) 


Xnyp —4 °° x, —X% 
From (1.8) we find 


n 


X=x (X,+2 ~~ Xn41)? 
= A449. Gg ee 
Nee 2h Ths 


This suggests that the sequence {¥,,,,} defined by 


= (x +2 7 % +1)? 
= = a A oe cee 
ae nes *n+2 — 2%n41 + xX, " 
converges more rapidly to x than the original sequence. This is indeed 
the case; for if 
X, — xX = Ad” 4+ 0(A"), [A] <1, 


then it follows that 
Le SOP). 


Several remarks are in order. First, this process can be iterated to 
remove successively components in the remainder of the form 


Ad” Ba? Cy", dns 


where 1 > |A| > |u|] > |x| > +--+. The cases in which there are 
equalities such as |A| = |u| can be handled by simple modifications. 
Second, it is important to note that this process can make things worse if 
the convergence is not geometric as required by (1.7). Here is a simple 
example involving two of the standard iterative processes for determining 
the reciprocal of a number N. Consider the sequences 


Boe, a (1 mee N)\yn or I, fnu1 = Z.(2 = Nz,). 
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In the case N = % with », = 1, z) = 1, we obtain the following table: 


Jn Jn =n ran 

l ] 
oO me) 

1.5 —.25 1.5 —.125 
29 375 

1.75 —.125 2 1.875 —.2578 3.0000 
125 1172 

1.875 2 1.9922 2.0455 


The sequence { j,,} appears to converge more rapidly than { y,} while the 
sequence {Z,} appears to converge less rapidly than {z,}. These results 
can be easily established. First of all, each sequence converges to N—? if 
0< WN <1 for 


| 


Jn N™ (1 _ N)"(¥o —. N-?) 
and 2, — N= —N®-1(z, — N-)®” 


Thus { y,} satisfies the condition (1.7), while {z,} does not, converging too 
rapidly. Inthe present case we have j,, = N~!. Onthe other hand, it 
can be shown that 

Zz, — N 

=a + —. 
Note, however, that to justify the application of this process it is sufficient 
to show the existence of an expansion of the form (1.7), with |2| < 1. 

Extensive use of this process was made in experiments 1n conformal 

mapping by Blanch and Jackson [17] and by Todd and Warschawski 
[18]. For instance, in the latter, the mapping of an ellipse (of axis ratio 
5:1) on a circle, it was found that about 50 iterations, each requiring 
about 30 minutes of computing on SEAC, were required to secure 
directly about 9 correct decimals in the value of the boundary function. 
It was, however, possible to obtain the same accuracy by a double use of 
the Aitken process on the first 14 iterants—the extra time required for 
this being negligible. 


1.4 Modified Differences 


We shall show here how the use of quadratic interpolation enables the 
tablemaker to cut down on the size of a table at the expense of some work 
by the table user. We shall then show how a further saving in space can 
be accomplished, at no further expense to the user but at some to the 
tablemaker, by the use of modified differences. For simplicity and 
definiteness, we consider the construction of a table of sin x, to 4 places of 
decimals, for x in the range (0,)%7). 
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(a) Linear Interpolation 


The error involved in linear interpolation, that is, the assumption that 


S(a + ph) =f(a) + pl fla +h) -f(a)], O<p <1, 
can be estimated as 


h2 


(5) max | f"(x)|. 


If this is to be less than 44 x 10-4 an appropriate choice for f is .02. 
This requires a table of some 80 entries, part of which is shown below: 


x sin x 
.00 .0000 
02 .0200 
.04 .0400 
.06 .0600 
.08 .0799 
1.20 9320 
1.22 9391 
1.24 .9458 
1.26 9521 
1.28 


.9580 


Interpolation, say for x = 1.234, in this table is carried out as follows: 
sin 1.234 = .9391 + !4o (.9458 — .9391) = .9438. 
(b) Cubic Interpolation 
We now consider using the Everett interpolation formula 
So = Yo + bh + F20%fo + Fed + EyOfo + Fyotf, + +++ (1.9) 
where f, =/(a + ph) and 
q=1-f, E, = (7? — 1)/6, F, = p(p? — 1)/6, 


If we retain the first four terms the truncation error can be estimated 
as 


ha 


rz ') max | f4(a)| <M x 024 x 1. 
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For this to be less than 2 x 10-4 we can conveniently take 4 =.2. The 
corresponding complete table is given below. 


x sin x 02 
A) 0000 0 
2 1987 — 80 
4 3894 — 155 
6 5646 —224 
8 7174 —287 
1.0 8415 — 336 
1.2 9320 — 37] 
1.4 9854 —392 
1.6 9996 — 400 
1.8 9738 — 387 


For interpolation, we now have either to compute the Everett co- 
efficients or to obtain them from a table; we find 


pol), E, = —.0430, F, = —.0275. 
We then have 
sin 1.234 = .9320 + 34400(.9854 — .9320) 
+ (.0430)(.0371) + (.0275)(.0392) = .9438. 


(c) Comrie’s Throwback 
This device, introduced by Comrie [19], depends essentially on the 


fact that the ratio 
_ 2g _ (bh + Ip — 3) 
Me) = B= 0 
is approximately constant for 0<f< 1. Various ways of choosing a 
mean value for this have been discussed [20]. The preferred value is 


k = —.18393. With this value of k we rewrite the first four terms of 
(1.9) as 


Sn = fo + th + [E2(o%fo + kOtfo) + Fa(d%, + kOtf)]. (1.10) 
Therefore, if we define 
bf = 88f + kosf 
and use these modified second differences in exactly the same way as we 
used the ordinary second differences in the preceding subsection, we can 


obtain the desired accuracy of interpolation with a much larger interval. 
In fact the error in (1.10) is made up of the truncation now bounded by 


As 


(? 5 °)] mas If %(x)| <A x .0049 ae 


(So gle 
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together with the error caused by the modification. It can be shown 
that the latter is less than half a unit in the last place if the fourth 
differences are less than 1000 units (and the fifth less than 70 units). 
The condition (1.11) gives A < .46, which suggests that h = .5 might 
be acceptable. For this value of A, a bound for the fourth difference is 


(.5)4 x max | f 4(x)| < .0625 


which is acceptable, although the bound for the fifthis not. Nevertheless 
we shall use A = .5 without carrying out a more precise estimate. 
The complete table is given below: 


x sin x 62, 
0 £0000 0 
a) 4794 —1225 
1.0 8415 —2154 
1.5 9975 —2552 
2.0 9093 —2326 


For interpolation, we first find the Everett coefficients 
p = .468, EF, = —.0636, Fy = —.0609. 
We then have 


sin 1.234 = .8415 + 254499(.9975 — .8415) + (.0636 x .2154) 
+ (.0609 x .2552) = .9145 + .0137 + .0155 = .9437. 


The discrepancy between the results can be explained either by the 
marginal choice of 4 or by rounding errors. 

More elaborate types of throwback—for example, of the sixth differ- 
ence as well as of the fourth—were also given by Comrie. In the past 
few years a unified account of theory of the throwback was developed, 
and its relation with expansions in Chebyshev polynomials was estab- 
lished. This is discussed briefly in [21] and in detail by Fox [96]. 
Some minor disadvantages of modified differences have been discussed 


by Comrie [19]. 


15 Characteristic Roots of Finite Matrices 


Considerable effort has been expended in problems of numerical 
analysis involving matrices. Bibliographical material is available in 
[22, 72] and in Householder [103]. The two main problems are the 
inversion of matrices and the determination of their characteristic values. 
In both problems the practical determination of bounds for characteristic 
roots is important (see Chap. 8). In this connection we call attention 
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here to the following lemma of Gersgorin [24], which has many appli- 
cations: 

All the characteristic roots of A = (a,;) lie in the union of the circular regions 

la,, — z1< > Ia,;l, es Oy rere (E 
i¥j 
This is proved by use of the fact that a determinant with dominant main 
diagonal does not vanish. This last result has been generalized by many 
writers; for an account of some of the work, see Taussky [25] and Parodi 
[111]. 

One of the preferred and practical methods of getting all the charac- 
teristic roots of a symmetric matrix depends on the reduction of tHe 
matrix to pure diagonal forms (Goldstine, Murray, and von Neumann 
[26], Gregory [27]) by superposing orthogonal transformations involving 
two variables at atime. Theoretically we obtain 


TAT =-Giag (Ajy Ags ssey Ay), 


and then A,, A,,..., 4, are the exact characteristic values. In practice 
we find 
TAT = (ee) 


where the e,;; are small for: #7. We then ask, How near are the e;; to 
the A,? If we disregard the question of the transformation not being 
truly orthogonal—and therefore of the characteristic roots of (e;;) not 
being identical with those of (a,;)—the answer comes at once from the 
lemma. If the e,,, 2 47, are sufficiently small, then 


lA; — egl <= > lel, es ee rere F 
‘Fj 
Allowance can easily be made in this inequality for round-off error in 
the product TAT’. 
1.6 Quadrature, Integral Equations 
(a) Quadrature 


We begin with an example to show that there is still scope for new 
ideas in classical numerical analysis. A typical quadrature formula is 


[Foo dx = Sp. ftad) 


and the error 
E =| | fs) dx —Za.fls) 


is estimated as a multiple of a (high) derivative f'"(x) of f(x) at a point in 
(a,b). In many cases it 1s far from convenient to obtain bounds on f "(x) 
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or to estimate these by computing the corresponding differences manu- 
ally. Recently Davis and Rabinowitz [28, 28a] reconsidered this prob- 
lem in the case when f(x) is analytic in a region including the segment 
(a,b). Eberlein [29] has also contributed to this problem. The case of 
ellipses & with foci at the end points, which we normalize to (1,0), (—1,0), 
can be handled elegantly, in terms of the Chebyshev polynomials 


(1 — z?)~-% sin [(n + 1) arccos z] 


which are orthogonal over the area of such an ellipse. It can be shown 
that 


IE] < o If ll 


where oy is a constant depending only on the ellipse & and where 


fl = | | f(z)? dx dy. 


& 


(Note that || / || increases as & expands; however, og then decreases, and 

there is a problem of optimal choice of &.) The og can be tabulated 

once for all, and || f || can be estimated in terms, for example, of max |// |. 
As an example, consider the evaluation of 


[ P(x) de 


using a 7-point Gaussian rule. To evaluate and bound the fourteenth 
derivative of I(z) seems rather out of the question. Simple estimates 
can be used in the method just described to find 


[Z| < 2.04 x 10-12. 


A comparison of this estimate with that given by the usual one [30, 31], 


Sf (E)(n!)4 . 
(2n!)3(2n + 1)’ 3<t< 4, 


where the derivative is now estimated by the use of Cauchy’s formula, 
shows that the new one is somewhat better. For further developments 
in this vein, see Davis [112]. 


(b) Integral Equations 


Among the basic problems in the numerical analysis of integral equa- 
tions are the relations between the eigenvalues of a (symmetric) kernel 
to those of an approximating matrix. A satisfactory account of this was 
given recently by Wielandt [32] in support of some experiments on con- 
formal mapping [18] which were being carried out on SEAC, the 
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National Bureau of Standards Eastern Automatic Computer. The con- 
tinuous problem is the solution of 


[ K@ev(@) af = bt. 


We make this discrete by introducing a quadrature 


[ (2) 4 = Sp. FE (1.12) 


Ld 


and are therefore led to consider the matrix problem 
> K(E,,E,) Prd = Wu 


What are the relations between the finite number of « and the infinity of 
thek? We quote atypical result. If we take for (1.12) the trapezoidal 
quadrature 


[fed =p tite that fas), 


n 


then, provided K satisfies 

|X(x,¢) — K(«,B)| << L(x — a] + ly — Bl), 
where 2, 8 run through the points (7/2, j/n) and where |x — «| < }4n7}, 
ly — B| < }an-!, we have 


ao panes 


where the constant C = 4 + 2 is best possible. 


(c) Convergence and Stability 


Problems of convergence and stability in the numerical solution of 
ordinary and partial differential equations lead to interesting and subtle 
questions. 

Consider first an ordinary differential equation for a function y = )(x). 
One meaning of a numerical solution is a sequence of values Y,, = y(x,), 
at a sufficiently dense set {x,}, which approximate y(x,) to within an 
assigned tolerance. [In the case of a characteristic-value problem, we 
also have to ensure that the characteristic values in the discrete problem 
are sufficiently near to those of the continuous problem; see also Sec. 
1.6(5) above.] Various prescriptions for finding such values are avail- 
able. In general, however, these prescriptions are not carried out 
rigorously; calculations are made with rounded numbers. In some 
cases the effects can be catastrophic (see, e.g., Todd [82]). An in- 
dication of a source of difficulty is the following: we may be attempting 
to compute a bounded solution of a differential equation which also has 
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an unbounded solution—for instance, e~* in the case of y” = y. An 
error may introduce a component of the unbounded solution which will 
soon predominate. 

So, apart from the adequateness of the approximation of Y,, toy, (the 
problem of convergence), we have to investigate the sensitiveness of the 
numerical process to the limited digital character of our equipment. 
Instead of producing Y,, we actually produce Y,; the study of |Y, — Y,| 
is the problem of “‘stability.”’ 

Among those who have contributed to this field are Rutishauser [83], 
Lotkin [89], Dahlquist [118], and Henrici [117]. 

Similar problems, as well as some new features in those just mentioned, 
appear in the study of partial differential equations (see Chap.11). The 
main new feature is thut in many cases restrictions on the shape of the 
mesh are necessary for convergence [85] and for stability [84, 86, 109]. 
The convergence restrictions, which can often be motivated physically, 
are in many cases those which are significant numerically. This con- 
nection has been developed by P. Lax and others and has been discussed 
in the monograph of Richtmyer [108]. 

A final remark, which we do not elaborate, is the following. It 1s 
tempting to base error analyses on the hypothesis that individual errors 
are random variables. This, of course, is not the case, because the 
errors are completely determined once the details of the computational 
process are settled. However, such assumptions often lead to useful and 
realistic estimates (see, on the one hand, Goldstine and von Neumann 
[5a] and Rademacher [88] and, on the other, Huskey [87]). 


1.7 Game Theory and Related Developments 


In game theory there are problems in which the intuition of a geometer 
can play an essential part; for instance, the theory of polyhedra and 
convex bodies and fixed-point theory are highly relevant. The founda- 
tions of a theory of games were laid down by von Neumann (33, 39, 74, 
75], beginning in 1928. The theory of two-person zero-sum games 1s 
well developed; but the practical problem of finding the value of such a 
game and the optimal strategies is difficult, and the solutions available 
so far are not entirely satisfactory. Among related and essentially 
equivalent problems are the solution of systems of linear inequalities, the 
solution of linear programs in the sense of Dantzig, and the Chebyshev 
problem of determining the minimax “‘solution”’ of an inconsistent 
system of linear equations 


my = 2 date + % = 0, pret de Deals 
=I 


that is, the set of values x, which minimizes the maximum of the residuals 


In;| (see, e.g., Stiefel [106]). 
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Among the methods of attacking these problems are the simplex 
method [34], the relaxation method [35, 35a], and the double-description 
method [36]. We shall, however, discuss a very simple example by a 
natural approach due to Brown [37], the validity of which was established 
by Robinson [38]. A related continuous solution of this discrete prob- 
lem has been given by Brown and von Neumann [37a]; we take up this 
idea of continuous approach to discrete problems again in Sec. 1.9(c). 

Consider the following game played between two players, R and C, 
each of whom has two strategies, which we may interpret as the choice of 
a row or a column in the pay-off matrix P = (f,;): 


P=(, >} 


If R chooses the :th row and C chooses the jth column, then R gets #,; 
fromC. This is manifestly an unfair game, and R should pay to play it. 

The value of this game is 2.5, and the optimal strategies are the follow- 
ing: R should choose | and 2 each with probability 2; C should choose 1 
with probability 4 and 2 with probability 34. The significance of these 
statements is the following: if R plays in this way, his expected gain is not 
less than 2.5, whereas if C plays this way, his expected loss 1s not greater 
than 2.5. 

To prove this statement is simple. Let R play 1 with probability 
r > 0 and 2 with probability 1 — r > 0; let C play 1 with probability 
¢ > 0 and 2 with probability 1 —c >0. Then the expectation of R is 

E=1x2r+4(1 —rc + 3r(1 —c) +201 —r)(1 — 0). 
We have 
E= —4(r —la)(e — 4) + &%. 
This shows that when r = 4 then E = % for any ¢ and that when 
r # % thence can be chosen to make E < %; similarly, if ¢c = %4, then 
E = %, and if ¢ 4 ™%, then r can be chosen to make E> % (see 
McKinsey [39]). 

How can we arrive at these results, or approximations to them? 

We shall describe an algorithm for an alternating choice of strategies by 
R and C which can be interpreted as follows: each chooses that strategy 
which is better in comparison with the observed behavior to date of his 
opponent. 

After n plays, suppose C' has chosen 1 in ¢," x n plays and 2 in the 
remaining ¢c,'") x n plays. (For convenience, from now on, we shall 
drop the superscripts.) IfC continues this pattern, the expectation of R 
in the next play 1s 

é6 =¢, + 3c, if he chooses 1, 
e, = 4¢, + 2c, if he chooses 2. 
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R therefore chooses | or 2 according to whether e, > é¢, ore, < é,. Simi- 
larly, if R has chosen | in7,'") x n plays and 2 in the remaining 7,"") x n 
plays, then C' chooses | or 2 according to the expected size of his loss, 
which is 

fi=n t+ 4r, if he chooses 1, 

fe = 3r, + 2r, if he chooses 2. 


Specifically C chooses 1 if ff <f, and 2 if ff > fh. 

We have used the word “‘play”’ loosely in the preceding paragraph. 
We actually describe an algorithm for a choice of strategies by R and C 
alternately. A sequence of strategies is determined after we make an 
arbitrary choice of an initial strategy for R, say 1. The resultant se- 
quence can be combined in pairs to give a sequence of plays in the proper 
sense : 


(sds C2) (252252): (252), (252)5° (252) (252) (152), 2) (152); 
(1,2), (1,2), (1,1), (1,1), (251), (2,2), (2,2), (2,2), (2,2), (1,2), (1,2), (1,2), 
(Lj1)5 ClD)(2,F 1s 251) 5.( 252) 5 (252s 252)542,2)5 (252))(252) (252) 
(2,2), (2,2), (2,2), (2,2), (2,2), (1,2), (1,2), (1,2), (1,2), (1,2), (1,2), (1,2), 
(1,2), (1,2), (1,2), ..-- 


It has been shown [38] that, as n —> 0, the sequences ¢,'"), 7,'") con- 
verge to the optimal strategy, that is, ¢,'") + 4%, 7," + 4, and that the 
average pay-off p'” converges to the value of the game. In our case 


7,(50) = 48, (50) = 2, pls) = 2.4, 


The structure of the sequence above, consisting of blocks of identical 
elements, is typical; this can obviously be used to speed the computa- 
tions. For some practical experiments in this field, see [40]. Another 
algorithm, the convergence of which has also been established, is the 
following. We begin with the choice of a play, say (1,1). Then future 
plays are determined by the simultaneous choice of strategies by R and C 
according to the previous rule. The sequence of plays now begins 


(1,1), (2,1), (2,1), (2,2), -.-. 


It has been observed that convergence of the above alternate-choice 
algorithm is often faster than that arising from genuine simultaneous 
choice of strategies. 

We shall now discuss an application of the theory of games to the so- 
called assignment problem. This problem is to assign n square pegs ton 
round holes in such a way as to maximize the total goodness of fit. In 
other words, (a,;) being given, we have to choose a permutation (1) of 
(1, 2,...,) so as to maximize 


> Fini (1.13) 
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This is trivial theoretically; we have only to find the largest of the n! 
sums of the form (1.13). In practice, however, this may be out of the 
question, and so we may have to settle for some approximation to the 
maximum. One way of doing this (suggested by von Neumann [41]) is 
to set up an equivalent game-theory problem—it turns out to be a sort of 
hide-and-seek—and solve this approximately by the method just dis- 
cussed. ‘The first player chooses a pair of indices (2,3) (l <i <14, 
1 <j <n); he has n? strategies. The second then elects first to guess 
the first or second of these two indices, and then guesses it by choosing k 
(1 <& <n); he has 2n strategies. In the first case if k = 1, and in the 
second case if k = 7, the first player pays the second (a;,)~1; otherwise 
there is no pay-off. 

Assignment problems for n = 12 have been handled by this method. 
It has, however, been found that a direct approach which regards the 
assignment problem as a special case of a transportation problem (see 
[42]) has been very successful. One chooses a permutation matrix (,,) 
such that }, , p,;4,; is minimum, and solves this, for instance, by the 
simplex method [34]. 

An up-to-date account of this problem and its generalizations has been 
given by Motzkin [42]. Among these are the transportation problem, 
the caterer problem, the problem of contract awards, and the traveling- 
salesman problem [43, 44]. Solutions to problems of this type are now 
obtained on a routine basis, on high-speed computers, as an aid to man- 
agement decision in industrial and military situations [45]. Among 
other problems of this general character, which are in the research stage, 
are those concerned with organization theory, which have been studied 
by Marshak and Tompkins [46]. 


1.8 Monte Carlo 


This is a subject with large areas unsoiled by theorems, as can be seen by 
reference to the reports on various symposia held on the subject [67, 68]. 
For instance, during the last four years we have been generating millions 
of pseudo-random numbers on SEAG, using such relations as 


Le __ os — 417 

a a Oe Xni1 = px,(mod 247), Keces 1. i." 
or 

a a ee Xnay =X, +X, _,(mod 244), ec 0; A I, 


See, for instance, Chap. 4, Taussky and Todd [68, pp. 15-28], and 
Todd [81]. Ther, behave as if they came from a uniform distribution 
in the interval (0, 1). The results we obtained were satisfactory in 
all cases where we had independent checks. We have, however, no 
theorems at all about the “‘randomness”’ of these sequences or about the 
distributions in blocks of the size used in our calculations. 
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We mention here also the quasi-Monte Carlo processes studied by 
Peck and Richtmyer [47, 48]. Here high-power algebraic number 
theory is used to evaluate the error committed by replacing integrals by 
sums of the integrands at points determined by certain algebraic num- 
bers. See also Halton [92] and Hammersley [120]. 


1.9 Recent Activity in Numerical Analysis 


A few areas are mentioned here with which the author is familiar and 
which he has found interesting. This personal selection omits reference 
to many areas in which there have been important advances (e.g., 
meteorology) and to areas which have been discussed elsewhere in this 
volume. 


(a) Ultramodern Numerical Analysis 

One class of experiments may be described as follows. It has been 
usual in discussing properties of matter to regard the medium as contin- 
uous, to set up differential equations, look at them for a while, give up, 
and replace them by difference equations. These difference equations 
are then solved, and no attention is paid to their physical significance, 
if any. 

An alternative approach is to handle the problem discretely from 
the beginning, lumping the “molecules” together in groups as small as 
the computing equipment can handle. 

Among those who have handled problems in this general way are 
Seeger, von Neumann, and Polachek (see Seeger [49]), who were con- 
cerned with shock-wave phenomena. Pastaand Ulam [50] have studied 
the mixing of fluids and the motions of star clustersin this way. Metro- 
polis and Fermi [51] have investigated the equations of state of individ- 
ually interacting particles forming an idealized liquid. Fréberg [78] 
has studied a model of a photographic emulsion. 


(b) Biological Applications 

There has been pioneering work by Turing [52] on the problem of 
morphogenesis. Turing constructs a mathematical model of a growing 
embryo and shows how well-known physical laws are sufficient to 
account for many of the facts about the development of its anatomical 
structure. | 

Another application has been the study of the reaction of nerve fibers 
to electric stimuli. These phenomena are governed by a system of four 
nonlinear ordinary differential equations (Hodgkin-Huxley). The 
system has been studied by Antosiewicz, Cole, Fitz Hugh, and Rabino- 
witz, and, in particular, the threshold value of the input current has been 
determined. ‘The agreement with the results of many experiments in- 
dicates the reliability of the model and encourages further investigation 
(see [53}]). 
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Barricelli has studied numerical analogues of genetic and evolutionary 
processes. 


(c) Combinatorial Analysis 


Combinatorial analysis is an obvious source of problems. There have 
been recent reports on this topic by Cairns [54] and Tompkins [55]. 
The numerical analyst, however, soon finds himself out of his depth if he 
uses straightforward approaches. 

One new idea which was tried is that of a continuous approach to 
discrete problems, in particular to the search for perfect difference sets. 
A perfect difference set is a set of n + 1 integers whose n(n + 1) differ- 
ences take on all nonzero values mod n? + n+ 1. For example, the 
differences of 1, 2, 4 are +1, +2, +3; that is, all nonzero values mod 7, 
and so 1, 2, 4 form a perfect difference set mod 7. 

A perfect difference set Y can be specified by N = n? + n + 1 con- 


stants x,, where x, = lifre SY, x, = 0 otherwise. In this case we have 
Dee ee an ae 
r 
I= KA ee |, 6222 essa IN SS 


r 


(the subscript r + 5 1s to be understood mod NV); hence 
dJr = (n + 1)? (1.14) 
It follows, therefore, that such a set x, minimizes 
J = (n +1) x — ny, 
for, in view of (1.14), J differs by a constant from 


Lo @ + DF + @ +1) E0,- 1 


This suggests an attempt to obtain a set of x, by minimizing J, now re- 
garded as a function of the N continuous real variables x,, subject to 
(1.14) (and perhaps to other relations such asO <x, <1). Such an 
attempt was made on SWAG, the National Bureau of Standards Western 
Automatic Computer, by a steepest-descent process. Although admis- 
sible values of _y were obtained rapidly, the corresponding values of x 
were not integers. See also Chap. 15 and [115]. 


(d) Number Theory and Algebra 


The subjects of number theory and algebra are natural sources of 
problems, and there have been many applications of high-speed com- 
puters in these areas, particularly in elementary, algebraic, and analytic 
number theory, as well as some in algebra proper [55a, 56]. 
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Recent work on SWAG, mainly on elementary number theory by D. 
H. Lehmer, E. Lehmer, and their collaborators, has been discussed by 
E. Lehmer [57]. 

Among other work has been a study of the divisibility of [(p — 1)! 
+ 1]/p by p. This was known to be the case for p = 5, p = 13; Gold- 
berg [58] found that it was also the case for = 563 and for no other 
p < 10000. 

Problems in algebraic number theory are more complicated to handle. 
A survey of computational problems in this field has been given by 
Taussky [59]. Since then there has been work by Cohn and Gorn on 
units in cubic fields (see Cohn [60]). 

There have been various attempts to study the zeros of the Riemann 
zeta function; among those is the work of Turing [61]. 

The studies in algebra proper include the work of Paige and Tompkins 
[55a] on the systematic generation of permutations, with applications to 
group theory, and that of Goldberg [56] on the Baker-Campbell-Haus- 
dorff formula. Forsythe [80] has enumerated all the 126 semigroups of 
order4. The characters of symmetric groups have been investigated by 
Bivins and others [79] and by Comét [107]. 


(e) Topology 

It is clear that approximate computations of quantities known to be 
integers serve to define them if the absolute value of the error is known to 
be less than 4%. This is used in the work on p(n) mentioned earlier. 
Pasta and Ulam [50] have suggested that further applications can be 
made in an essentially topological problem—for example, the structure 
of the lines of force caused by current in two infinite straight wires which 
are skew. 

A simple application of this is to the locatior. of the zeros of a poly- 
nomial P(z). We use the fact that 


1 [P'(z) 


" Oni Je P(z) 


where n is the number of zeros inside the simple closed rectifiable curve 
C. Itis possible to choose C to be a square so large as to contain all the 
zeros of P(z) and then, by process of quadrisection, to locate the zeros 
approximately. The quadrature must be accomplished with absolute 
error less than 14; if this proves difficult because of the vanishing or near 
vanishing of P(z) on the boundary, then we know that we are in the 
neighborhood of a zero and can act accordingly. A constructive proof 
of the fundamental theorem of algebra along these lines has been given 
by Rosenbloom [62]. See also Tompkins [104]. 
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1.10 Theory of Machines or Automata 


Among those who have contributed to basic research in machine 
theory have been Turing [63], Shannon [64], and von Neumann [65]. 
There have been some efforts of a supporting-research character: the use 
of machines to design circuits for better machines, the design of self- 
correcting codes, and improvements in the use of machines—for example, 
more automatic coding. Much of this belongs more to the domain of 
logicians than to that of the numerical analyst. For recent work in 
these areas, see, for example, [99, 100, 102, 105]. A bibliography is 
given by Carr [116]. 
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2.1 General Introduction 


This chapter provides a survey of some of the techniques of classical 
numerical analysis, by which we mean, approximately, computing with 
desk machines. Even those who have the freest access to automatic 
computers and are most adept in their use find desk computers indispen- 
sable in various phases of numerical analysis—for instance, in pilot cal- 
culations, in checking, and in the preliminary (or final) analyses of 
results prepared by more powerfulequipment. Further, aspellofmany- 
decimal calculation on a desk machine is a helpful approach to the use of 
automatic computers, on which mistakes will be more costly, if less pain- 
ful, than on desk equipment. 

Ideally, this chapter should include methods appropriate for desk 
machines for handling all the various topics to which succeeding chapters 
are devoted. It is, however, restricted to the following topics: inter- 
polation, quadrature and differentiation, ordinary differential equations, 
miscellaneous devices, and tables. 

Although our main concern here is with desk computing, we do not 
hesitate to digress for elementary discussions of points in automatic com- 
putation where this seems advisable. 

This chapter should be read with a desk calculator and various stand- 
ard tables at hand. The reader can easily vary the worked examples 
to provide exercises which can be checked by reference to standard 
tables. This will, in addition, induce an acquaintance with tables. 
Among the particularly suitable tables are the tables of Bessel functions 
(BAAS 6, 10) and those of the Airy integral (BAAS 12) which were pre- 
pared by the Mathematical Tables Committee of the British Association 
for the Advancement of Science. ‘There is much valuable advice in the 
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introductions to other standard tables, some of which are mentioned in 
Secs. 2.37 to 2.42; referencés such as BAAS 12 are explained in these 
sections. 

A convenient collection of tables and formulas is “Interpolation and 
Allied Tables,” H.M. Stationery Office, London, 1956. Weshall refer 
to this as IAT. 

The material in this chapter is covered in detail in many well-known 
textbooks, some of which are listed in Sec. 2.43. We refer to some of 
these for amplification of our account at various points. 


INTERPOLATION 


2.2 Introduction 


It is not often that one finds the exact information one requires directly 
in a table; it is usually necessary to interpolate, or to read between the 
lines. Whether one can do this at all and, in case one can, the accuracy 
of the interpolation process depends on the “regularity,” or “‘smooth- 
ness,’ of the function under consideration, near the point in question. 
The meaning of the words in quotation marks is indicated by the form of 
the error estimates given below. 

Consider, for instance, the following table: 


59 10 15 20 25 30 35 40 45 50 = 55 
bs) se) A) i) i) i) 7 i) M) 5 11 


f(x) 


This is of no great help in determining f(x) for other (integral) values of 
x. (This is a table of the greatest prime factor of x.) On the other 
hand, a table like 


x 


g(x) 0 3 #12 #2 49 +~°«76 


lends itself to interpolation for any value of x when we notice the com- 
parative regularity of the growth of g(x). (This is a table of the integral 
part of 10* sin? x°.) 


2.3 Special Methods 


We begin with an account of processes in which some use is made of 
the particular form of the function—for instance, an addition theorem it 
satisfies, or its power series expansion. We give several examples. 
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To compute exp 1.23456 to 6D (6 decimal places) having a 6D table 
of exp x at an interval of .001 in x, we can proceed as follows: 
exp 1.23456 = exp 1.234 x exp .00056 
= 3.434942(1 + .00056) 
= 3.436866 


or, asacheck, exp 1.23456 = exp 1.235 x exp —.00044 
== 3.438379(1 — .00044) 
== 3.436866. 


Here and elsewhere we use the symbol = to indicate approximate 
equality in a loose sense. It is possible to give rigorous error estimates 
in this case, but this is rather exceptional. The first use of = covers the 
error in the tabular value of exp 1.234 and the error caused by the trun- 
cation of the exponential series, replacing e” by 1 + x. The second use 
of = covers the rounding offof the product. Itisclear that the relative 
error is at most about }2(¥2 - 10-8), if we compute from the nearer 
tabular value. 

On the whole, we shall simply use the equality sign, where the symbol 
== would be proper, and use the latter for emphasis. 

Similarly, we can interpolate in trigonometrical tables using the 
approximations 


sin (x + y) = (1 — 4%?) sin x + ycosx 
cos (x + y) = (1 — 4% 9?) cosx — ysin x 
when y is small. Checking can be done as in the preceding case, or, if 
both functions are needed, we can use 
sin? 6 + cos? 6 = 1. 


A word of warning is, however, necessary in the latter case. This check 
is not efficient when sin 0 is near Oor 1 (see Stegun and Abramowitz [1]). 
We can also handle the inverse trigonometrical functions in a similar 
way. Consider arctan x. We have 
arctan (x + ph) = arctan x + arctan A 
=arctanx + A—}4434+..., |Al <1, 
where A = ph/[1 + x(x + ph)]. The first two terms on the right suffice 
to give 6D accuracy when A = .01, |p| < 1. For instance, 
arctan .1234 = arctan (.12 + .0034) 
119429 + .0034/[1 + .12 (.1234)] 
= .122779 
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and, as a check, 
arctan .1234 = arctan (.13 — .0066) 
= .129275 — .0066/[1 + .13 (.1234)] 
= 122779. 


2.4 Recurrence Relations 

Another method of interpolation is the use of recurrence relations. It 
is known, for instance, that for N > 0 and for suitable values of x,, the 
sequence defined for n = 0,1, 2,... by 


Kn4 = Ya(xp .s Nx,~) (2.1) 


0 VN x=3(1+N) Xp=1 


Fic. 2.1 Quadratic convergence to VN. 


converges to N%, The convergence is illustrated graphically in the 
case x» = 1, N = .25 in Fig. 2.1. Arithmetically we observe that 


xnt+1 N4% = Ya(x, - NOx, 
which shows that 0 < x, implies x,,, > N%. Hence, if we choose 
xX» > 0, we have x, > N% for all n > 0. Again, since 

Xne1 — %n = JA(N — x,?)x,7) < 0, 
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it follows that the sequence {x,,} is a monotone decreasing sequence, and 
since it is bounded below (e.g., by N%), it has a limit, say /. Clearly, 
1>N%*> 0. Passage to the limit in (2.1) gives 


l=kil+ ND, 
so that 22 = N, 1 = N%, 

The convergence in this case is rapid. If we write «, = x, — N%, 
then ¢,,, = %e,?x,—! = O(e,?); in this circumstance the convergence 
is said to be quadratic, and the number of correct decimal places in x, is 
about doubled at each iteration. See Milne-Thomson [2]. In the 
example chosen, the sequence of approximations is 


1, .625, .5125, .500152, .500000, 
If, instead of (2.1), we take 
Sau = %y + (N — 2,23), Gay 


we can establish that x, tf N%, when 0 < N <1, x, =0. We note 
that the convergence in this case is linear: 


€n41 = €,[1 — 72(N% + x,)]. 


The slowness of the convergence in the case N = .25, x) = 0 is evident 
from the following sequence of approximations: 


0, .125, .242188, 
For NV > 0 and for suitable x,, the sequence 
An+1 = x,,(2 = Nx,) (2.2) 


converges quadratically to N-} (see Fig. 2.2). Ifwetake N = 4, x, = 1, 
the successive approximations to N—! = 2 are 


1, 1.5, 1.875, 1.992188, 1.999969, 2.000000, 
If, instead of (2.2), we take 
noi = (1 — N)x, + 1, (2.2’) 


we can establish /inear convergence of x, to N—?; in fact, e€,,, = (1 — N)e, 
(see Fig. 2.3). The corresponding approximations are 


1, 1.5, 1.75, 1.875, 
The relation (2.2) is the special case p = 1 of the recurrence relation 
Anti = x,(p a Nx,,?) |p. (2.3) 


For suitable x,, N, we have 
4 =v, 


These methods are used frequently in automatic computation, and we 
shall elaborate a little on them. 
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We first note that each of the relations (2.1), (2.2), and (2.3) isa special 
case of the Newton-Raphson method. A modern account ofthis process 
is given in Chap. 14. Here we note the formula 


Xnt+1 = Xn ITE (Xn) (2.4) 


which is readily motivated geometrically. For certain f and for suit- 
able xo, the sequence x, converges toa zerooff. We obtain (2.1), (2.2), 


1/N 


y= 2x-Nx* 
Fic. 2.2 Quadratic convergence to 1/N. 


and (2.3) by taking f(x) = N — x7, f(x) = N — x7, f(x) = N — x-?, 
respectively. 

We next note that there can be more than one recurrence relation, 
with a given order of convergence, for a particularfunction. In addition 
to (2.1), we have 

Jnsi = 27,7] (39,7 - N) (2.5) 
and 2 EZ, (ON 22 2 NG (2.6) 


which converge quadratically to N”, for suitable initial values. We 
can get (2.5) from (2.4) by taking f(x) = x3 — Nx. 
The recurrence relation (due to R. Dedekind) 
_ x,3 + 3Nx,, 


can be shown to have cubic convergence to N’?. 


(2.7) 
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The relation 
xn+1 = x,[3(] ~~ Nx,) =I (Nx,)?] (2.8) 


has cubic convergence to N-!. The study of this is instructive; con- 
vergence takes place when 0 < xy < 2N-1, 

In some of the earlier automatic computers there was no division in- 
struction, and the operation of division had to be programmed as a sub- 
routine; a convenient method was the use of a relation such as (2.2) or 


0 l=x, xy 1/N 
Fic. 2.3. Linear convergence to 1/N. 


(2.8). Similarly, the relations (2.1), (2.5), and (2.6) could be used to 
produce square roots; for machines without division (2.6) is particularly 
convenient. In practice, quadratically convergent sequences are usually 


sufficient. 
Among the other recurrence relations of interest is the arithmetic geo- 


metric relation of Gauss: 
If positive Xo, yo are given, X,,) = 18(x%, + In)s Int. = V%nIn» then lim x, 
and lim y,, exist and are equal. In particular, if x) = 1,0 <5 <1, then 


lim x, = lim y, = 7/[2K'(y5)] 


where K'(k) =[a — x*)(1 —(1 — k*))}-4 dx. 
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An elaborate discussion of other relations of this form has been given 
by King [3]. 

There are similar relations which can be used to generate the elemen- 
tary transcendental functions (see Hurwitz [4]). Another result, due to 
Borchardt, is as follows: 

Uf positive xo, Vo are given, x, ,, = ¥3(X, +n), Ina: = V Xuan thenlim x, 
and lim _y, exist and are equal. In particular, if xy = cos 0, yo = 1, then 
lim x, = lim_y, = (sin 6)/6. 

We note that the convergence proof given above for the sequence 
defined by (2.1) isacademic. In practice (whether with desk machines 
or automatic computers), we cannot have an infinite descent x9 > x, 
> xX, > ttt N*, 

Let us take a simple example to show that the relation x, > x, may 
be false. ‘The exact method of application of the algorithm and the 
precise behavior of the computing equipment must be specified. For 
simplicity, we use a machine working to 2 decimal digits, with multipli- 
cation and division rounded by the addition of a 5 to the first digit to 
be discarded. We use the relation (2.1) in the form 

x, = .50x5 + .50(N = xq). 
With NV = .01, x» = .11, we get N + x, = .09 and .50 x .11 = .06; 
then .50 x .09 = .05, and 

x, = .05 + .06 =.11 =x). 

The full examination of (2.1) is a very delicate matter, even in the case 
of fixed-point computers. For a thorough discussion, see Householder 
[5] or Goldstine et al. [6]. The examination in the case of computers 
with floating-point arithmetic becomes more complicated and was 
recently carried through by Rumsey [67]. The natural time to stop 
would be when the sequence x, > x, > : +: becomes stationary or turns 
back, and with an efficient algorithm (and machine) one would expect 
to obtain at this stage the best possible result. That is to say, ifr is the 
number alleged by the computer to be N %, then z is the result of round- 
ing N% to fit the machine. 

That care should be taken in the details of the algorithm is shown by 
the following example, which we give, for simplicity, in an academic 
form. Let us suppose we have decided that the first x, for which x, 

> x,_, is the approximate square root. Then, if N = —Mand x, = 1, 
we obtain successively 

l, “4, -% -—Mnre, -+--, 
and we thus obtain 


V—-% = —n1e. 
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Thus any program for evaluating square roots should include an initial 
check that N >0. The omission of such a check is not likely to be 
serious if one is simply evaluating square roots, for one is not likely to ask 
for the square root of a negative number; but in more realistic problems, 
when one is using the program as a subroutine, to evaluate the square 
root of a calculated quantity which is never displayed, failure to check 
can cause havoc. 

We note, finally, that the scalar recurrence relations discussed in this 
section can be applied in more general situations, when the real number 
N is replaced by a matrix or an operator. The use of these generaliza- 
tions to find (or to improve) an approximate inverse of a matrix or an 
operator is discussed in Chaps. 6 and 14. 


2.5 Reduced Derivatives 


A very natural way to interpolate is to use Taylor’s series. If 


f(a +h) = fla) + Af"(a) +5 f(a) aaa (2.9) 


then, if we know f(a) and its successive derivatives f’(a), f"(a),..., it 
will be easy tocompute f(a + 4) forsmall. For instance, if f(x) = sin x, 
we have 


x f f ie a 
4 .3894 .9210 — .3894 — 9210 
3 .4794 .8776 — .4794 — .8776 


and we can find sin .4321 as follows: 


sin .4321 = sin (.4 + .0321) = .3894 + (.0321)(.9210) 
+ 14(.0321)2( —.3894) + 4%(.0321)3( —.9210) +--+ = .4188. 


We can check this value by writing sin .4321 = sin (.5 — .0679) and 
proceeding as before. 

In practice, however, it is usual not to tabulate the successive deriva- 
tives f‘")(x) themselves but rather the reduced derivatives tr” = hf") (x) /n!, 
at an appropriate interval hk. The table above would now appear as 


x f T 7 7 
4 .3894 921 —19 —2 
me) .4794 878 —24 — | 


A table in this form is especially convenient when not only f/ but also 
f' is likely to be needed. For, if we take (2.9) in the form 


f(a + 6h) = fla) + OAf"(a) + (PRANK (a) +++ 
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and differentiate this, we get 
hf'(a + 0h) =hf'(a) + 20 hy” faye 308 Af "(a lacs 
= 7 + 207? + 36273 +--- 


For worked examples, we refer to the introductions to standard tables 
(BAAS 12, NBSAMS 17). 


2.6 Lagrangian Methods 


In the cases discussed so far we have made use of special properties of 
the function f(x) under consideration. We now discuss the Lagrangian 
method, in which no special properties of f(x) are used. The basic idea 
is to obtain a good approximation L(x) for f(x) in terms of simple func- 
tions, in particular polynomials, and to evaluate /(&) approximately as 
L(é). The method is founded on the following result from elementary 
algebra: 

There 1s a unique polynomial of degree n assuming n + 1 arbitrary values f, at 
any (n + 1) distinct points xo, X1,...,%,- This polynomial is 


L,(2) = La(fa) = ¥ fils) 


where Ke) = (x) = TP Ue — alee — 4) 
and the product is over all} = 0,1,...,n;j #2. 

The existence of an interpolating polynomial can be shown using the 
nonvanishing of the Vandermondian. For simplicity, let us discuss the 
case n = 3, the 4-point case. We change the notation and ask for the 
existence of a cubic ax® + Bx? + yx + 6 which assumes the values A, B, 
C, D at distinct points a, b,c,d. This assumption requires 


A =aa* + Ba? + ya + 6 
B=ab? + Bb? + 7b +6 
C= ac + Bt +yeo4+6 
D = ad? + Bd? + yd + 0. 
This is a set of linear equations for a, 8, y, 6 whichcan be solved uniquely, 
since the determinant of the system is a Vandermondian which does not 
vanish, a, 5, c, d being assumed distinct. This establishes the existence 
of an interpolating cubic. That it is unique follows essentially from the 
fundamental theorem of algebra, for, if there were two, their difference 
would be a polynomial of degree 3 atmost, which would vanishata, d, c, d. 
It is clear that the «, 8, y, 6 are linear functions of A, B, C, D, and the 
same is true for the interpolating cubic ax? + Bx? + yx + 6, which we 
can therefore write as 


Asf (x) + BB(x) + CE(x) + DF(x), 
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where the polynomials ~%, @, @, @arecubics. This shows that tables 
of , @, €, D would greatly facilitate interpolation in tables of any 
function which allows 4-point interpolation. It is clearly not possible to 
contemplate tables covering arbitrary a, 5, c, d, but the problem becomes 
practicable if we restrict ourselves to the case when a — 6 = 6 —¢c = 
c — d = 1, particularly if we notice that we can assume that / = — 1 and 
a= —1,b =0,¢ =1,d =2. Using pf asthe nondimensional variable, 
we see that 


A(p) = L_y(p) = —p(p — 1)(p — 2)/6 
Bp) = Ly(p) = (p + 1)(b — 1) — 2)/2 
€(p) = L(p) = —(b + Ip(p — 2)/2 
Dp) = L(p) = (6 + l)p(b — 1/6. 


The general case of (n + 1)-point interpolation can be treated simi- 
larly. However, the result can be established directly by observing that 
1,(x;) = 6,;, so that 


LAS) = > Fi; = fis J — 0, I, oe ey Ml. 


The uniqueness follows by the argument used above. We note that, for 
all x, DJj(x) = 1. 

The question of the error in interpolation by the Lagrangian formula 
is significant only when we are given some information about the general 
behavior of the function. The usual remainder formula is 


es) n 
fe) — LE) = Gay UD - 9) (2.10) 


where f is assumed to have an (n + 1)st derivative in an interval in- 
cluding x, x9, x;,..., x, and where & = &(x) is a point in this interval. 
Note that we do not assume that the nodes %o, x,,.-.., %, are equally 
spaced. The idea of the proof of this result is fully illustrated by the 
linear case, which we shall now discuss. 

Linear interpolation between a and 5 = a + A gives, for f(a + ph), 


0<f <l, 
f(a + ph) = fla) + pl flo) — f(a)]. 
We shall show that 
S(a + ph) — {f(a) + pLf() —f(a)]} = Af"(2)p(p — 1). 


To do this we consider 


F(p) =f(a + ph) — (f(a) + pLfle) —fl@)]} — Ko(p — 1). 
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We first choose any fo, Po + 0, fy ~ 1, and then choose K = K(f) so 
that F(p)) = 0. Then F(p) has three zeros in [0,1]: 0, 1, fp. Hence 
F"(d) =0,0 <6 <1. This means 


h?f "(a + Oh) = 2K, 
so that 


f(a + poh) = f(a) + pol f(6) —fla)] + 724? (bo? — polf (4); 


where € = a + Oh depends on fp, through 6. We can now drop the 
subscript and write 


f(a + ph) — (fla) + pl flo) —fla)]} = AF"(4) P(e? — 6)); 
which is of the form (2.10). 


If we restrict our attention to the case of equally spaced nodes, say 
x, =a+ih,1 =0,1,...,n, and consider the error at x = a + ph, then 
(2.10) can be written in the convenient form 


dacs = p (n+l) f(n41)(E ' 
fo) - EU = (2 Na (8). 2.10' 

We shall now discuss these error estimates briefly and academically; 
we return in Sec. 2.10 to a more practical account. In the linear case 
we note that for interpolation proper we have 0 < p <1, andsoO < 
(p — p?) <™%. Hence 


f(x) — Ly(x) = —M6onef"() (2.11) 


for some 0,0 <6 <1. 

This means that the maximum error in linear interpolation in a table 
of sin x, at an interval of .02 radian, is 4¢(.02)? = 44° 10-4, so that this 
method is appropriate for a table to 4D. To see how realistic this esti- 
mate 1s, we consider the evaluation of sin .3367, given 


sin .32 = .314567, sin .34 = .333487. 


Lincar interpolation gives .330359, whereas the correct 6D value is 
330374. 

As another application of (2.11), let us determine a range of x for 
which linear interpolation in a 6D table of tan x at an interval of .001 
radian is appropriate. We have )/?(tan x)” = 4@(10-) -2sin x sec? x. 
As x increases from 0 to 14m, sin x and sec x both increase, and so the 
whole error estimate increases. Equating the error estimate to 14: 10-8, 
we see that sin x sec? x = 2, which gives x = 447. Linear interpolation 
is therefore appropriate in the interval 0 < x < M4z. 

The results which correspond to (2.11) in nonlinear cases can be ob- 
tained by examining the behavior of Il(x — x,) as x varies. For sim- 
plicity we consider the 4-point equally spaced case. We have then to 
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consider the behavior of (p + 1)p(p — 1)(p — 2). This vanishes at 
pf = —1,0, 1, 2 and has minima at 4(1 + V5) and a maximum at 4. 
The extreme values are —1, —1,and%». Thus we have, writing M = 
max | f “(x)], 


Ue) — Lal) < Koa M if O<x <h 
: (2.12) 
Ve) LW) <1 M if hx <0, hex <2h 


The smaller estimate for the central interval is in agreement with our 
intuition. 

These mean that the maximum error in 4-point interpolation in a 
table of sin x, at an interval of .1 radian, is about 2 x 10-®in the central 
interval and about twice that in the outer intervals. As an example, let 
us find sin .123, given 


sinQ=0, = sin.] = .099833, sin.2 =.198669, = sin .3 = .295520. 


The appropriate coefficients, which can be obtained from tables (or from 
the explicit expressions given above), are 


— 0522445, .8381835, .2503665, —.0363055. 


We find sin .123 = .122689—the correct value is .122690. 

For the general results corresponding to (2.11) and (2.12) see the in- 
troduction to NBSCUP 4. 

An elegant and practical method for reducing an (n + 1)-point inter- 
polation to a sequence of Yen(n + 1) linear interpolations has been given 
by Aitken. It is convenient for both desk calculators and automatic 
computers. We describe the 4-point case for simplicity. 

Given f(a) = A, f(b) = B, f(c) =C, f(d) = D, we show how to find 

S(p). Interpolate linearly between (a,A), (5,B) to find (p,B,) ; then inter- 
polate linearly between (a,A), (c,C) to find (p,C,) and then between (a,A), 
(d,D) to find (p,D,). The next stage is to interpolate linearly between 
(5,B,), (c,C,) to find (p,C,) and then between (5,8,), (d,D,) to find (p,D,). 
Finally, interpolate linearly between (c,C,) and (d,D,) to find (p,P). 

The scheme 1s illustrated graphically in Fig. 2.4. We note that there 
is no assumption that the a, 5, c, d are equally spaced, and so the method 
can be applied to the case of inverse interpolation. 

We give, without comment, two examples which show how the scheme 
can be carriéd out and how labor can be saved by dropping common 
initial figures. Many extensions of the method have been given by 
Aitken and Neville. 
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Given f(1) = 1, f(2) = 125, f(3) = 729, f(4) = 2197, find (2.5). 


l=a 

2=5 125 = B 187 = B, 

3=¢ 729 =C 547 =C, 367 =C, 

4= 2197 = D 1099 = D, 415=D, 343 = P, 
Here f(x) = (4x — 3)8 and the interpolation is exact, as it should be. 


a 6b pic d 
Fic. 2.4 Aitken’s algorithm. 


Given f(0) = 47.434165, /f(1) = 47.539457, /f(2) = 47.644517, 
(3) = 47.749346, find (1.4321). 


0 47.434165 — 1.4321 
l 939457 .984954 — .4321 
2 644517 682 837 + .5689 
3 .749346 517 60 24 + 1.5689 


We find /(1.4321) = 47.584824, which agrees with the fact that 
F(x) = V 2250 + 10x. 


We shall now show generally that the Aitken algorithm leads to the 
Lagrangian interpolant. We follow a proof of Feller. We want to 


evaluate {(p), where fis a polynomial of degree n determined by its values 
at the distinct points x9, x,,...,%,- Consider 


S(*o) Xo a 
ayy) — YY) *— 6! ‘e 
ft (x) ~~ xX — Xo ’ a Xo: 
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We observe that f(x) is a polynomial of degree n — 1 and that f")(p) 
= f{(p). Hence our problem is equivalent to that of evaluating f")(p), 
where f{) is determined by its values at x,, x,,...,*,- Repetition of 
this process according to the scheme 


xy (Xo) 
x f(%) f(x) 
Xp f(%2) fi (x2) ff (x2) 


Kn Sn) SO(n) FO a) Sn) = SP) 
leads to the determination of f(p). 

It has been necessary so far to insist on the nodes x9, x,,..., x, being 
distinct. It is possible to develop interpolation formulas with coinci- 
dent nodes: for instance, if x, and x, coincide, we consider polynomials 
which have not only an assigned value but an assigned derivative at the 
double node. ‘The extreme case occurs when all the nodes coincide at 
x, and we obtain a (finite) Taylor expansion for a polynomial, which, 
together with its derivatives, has assigned values at x». The most useful 
case is the hermitian one, when the nodes are all double; an up-to-date 
discussion of various aspects of this method of osculatory interpolation 
has been given by Salzer. 

A counting argument shows that we may expect to be able to find a 
unique polynomial H,,,,,(x) of degree 2n + 1 which has assigned values 
and assigned derivatives at n + 1 distinct points xo, x,,...,x%,. It is 
reasonable to expect that, if the assigned values are /(x,) and _f’(x,) where 
J (x) is differentiable 2n + 2 times, then |H,,,,(f, x) —/(x)| is bounded 
by a multiple of max | f ‘?"*+?)(x)|. All this can be established without 
the introduction of any new ideas. Indeed, with the earlier notation, 


Hay ea(s) = ZUM) — 2 (ad afl) +3 HG) — 2) (20) 
and f(t) — Hagsi(*) -pmmo [Te —2] */(Qn + 2)4, 


i=0 
where € = (x) lies in the interval containing x, %9, 1, +++ 5 Xn 
Examples showing the power of this method can easily be constructed. 
It is particularly appropriate in cases where both fand /’ are tabulated 
alongside each other. For instance, many tables give sin x and cos x, 


J,(x) and J,(x) = —Jo(x), and exp (—)%x?) and | exp (—)20?) dt to- 
gether. ° 

We note that, although the representations of the Lagrangian and 
hermitian interpolants are valid when the nodes are complex, the error 
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estimates given no longer hold. This is essentially due to the fact that 
the usual mean-value theorems are not valid in the complex plane. 
[For instance, if f(z) = e?, then (0) = /f(2m1) = 1, but f’(z) = e7 is 
never zero. | 

Suitable estimates can be easily obtained using Cauchy’s theorem 
making use of the following representation of the remainder: 

If © is a rectifiable closed curve, if f(z) ts regular in @, the interior of ©, and 
continuous on + QD, and if x and the distinct points ay, a,,..., 4, bein Z, then 


Lf S(2) TT — a) 
J(2) — LaF) = 2m Je (z — x) [J (z —@,) ae, 

There is a simple modification of this available in the case when f(z) is 
allowed to have a finite set of poles in @. 

This representation is useful even when the a; are real. The presence 
of singularities of f(z) off the real axis can affect the efficiency of the 
approximation of f(x) by £,(j,x) on the real axis. This point is elabo- 
rated in Chap. 3, where we discuss the Runge example. 


2.7 Finite-difference Methods 


The calculus of finite differences is a basic tool in numerical analysis. 
A certain acquaintance with it is convenient in the description and ex- 
planation of the many processes of interpolation which involve differ- 
ences. 

The formation of a table of successive differences of a function 
(tabulated at equal intervals in the argument) is indicated below in the 
case of a table of cubes. The notation is as follows. If we write 


aaa Oe en OP Pe 


x f(x) Af A’s A*s A’f 
1 l 
7 
2 8 12 
19 6 
3 27 18 0 
37 6 
4 64 24 0 
61 6 
5 125 30 0 
91 6 
6 216 36 0 
127 6 
i 343 42 0 
169 6 
8 512 48 0 
217 6 
9 729 54 
271 
10 1000 
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fin) =f, = F(a + nh), then we define A’F = A,'F to be A’f where 


Af(n) =f(n + 1) —f(n), A’f(n) = AA’1f (n), r> 1. 
This gives 


A*fin) = AAf(n) = [f(n + 2) —s(n + 1)) — [fn + 1) -S()] 
= f(a +2) — 2f(a + 1) +f(n). 
We find, using the standard notation for binomial coefficients, 
afin) = fon +1) — (tf +r —1) + (5) fe +r - 2) 
foeee (=1)4( 1) fer +1) +(-))f(a). 


This table illustrates the following fundamental fact: 

The nth differences (at any constant interval) of a polynomial of degree n are 
constant, and all succeeding differences vanish. 

This is readily established by noting that the operation of differencing 
reduces the degree of a polynomial by unity: 


a( $4," = > a,Ax* 
r=0 


¥ a,[(# + hy =x] 


It is only rarely in practice that we have to deal with exact poly- 
nomials. The effect of rounding off in a difference table is shown by the 
following table of .1x3 rounded to the nearest integer. 


x Ff (x) Af A*f A“f AT 


l 0 
l 
Z l 1 
2 0 
3 3 ] 2 
3 2 
4 6 3 —1| 
6 l 
i) 12 4 —3 
10 —2 
6 22 2 5 
12 3 
7 34 5 —3 
17 0 
8 51 5 0 
22 0 
9 73 a) 
27 
10 100 
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The difference which we expect to be constant is oscillatory (and higher 
differences will get larger). 
Let us see what happens in a bad case—for example, 


fle) =’ — (-)"a 


for asmall. The difference tables for f(x) and for f(x) rounded to the 
nearest integer are as follows: 


x f(x) Af AY AY AY x f(x) Af AY AY AY 


0 %-a 0 0 
2a ] 
1 Ww%s+a —4a 1 ] mn) 
—2a 8a —] +4 
2 Ww-a +4a — 16a 2 0 +2 —8 
2a —8a l —4 
3 lé+a —4a 3 l —2 
—2a —] 
4 \y-a 4 0 


This shows that we may have a spurious contribution of up to 2"-! units 
in the nth difference, because of rounding off of the functional values to 
the nearest unit. 

The effect of a change in the length of the (constant) interval of 
differencing is easily determined: the constant nth difference of x" at unit 
interval is n!, whereas if we take an interval 4, the nth difference is A" !. 


] l 


7 
v4 8 12 
19 6 
3 27 18 0 
37 6 
4 64 24 +1 
61 7 
i) 125 31 —4 
92 3 
6 217 34 +6 
126 3 
7 343 43 —4 
169 9 
8 512 48 +1 
217 6 
9 729 54 0 
271 6 
10 1000 60 
331 


11 1331 
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The formation of a table of differences is useful in checking a table or 
locating errors. The way in which an error propagates is indicated in 
the repetition opposite of the table on p. 42, with a deliberate error at 
x = 6, where we have written 217 instead of 216. The pattern of 
(1 — 1)4in the fourth column is characteristic. In practice, however, 
the exact binomial pattern will be more or less obscured by rounding 
errors, but it is usually easy to pick out errors, even when several are 
present and interfering (see Miller [8]). 

The binomial pattern can also be established by noting that differencing 
is a linear operation, that the effect of the error is just that of the differences 
of 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, and that these come immediately from the 
explicit formula for A’. 

So far we have mainly discussed the differences of polynomials. This 
is almost enough, for we are concerned with polynomial interpolation, 
which is applicable only when the function in question is approximately 
representable by a polynomial. If f(x) is differentiable, we have 


Af(x) =flx +h) — fle) = Af'(x + 9h), 0 <6 <1. 
Thus, if f’ is continuous and & 1s small, we have 
Af(x) = hf'(x). 


More generally, we can show that, if f(x) has a continuous nth derivative 
and hf is sufficiently small, 


Att (x) Spine). 
More precisely, if f')(x) is continuous in [a, a + nh], then 
Anf(a) = hnf (6), a<&<a-+nh. 


This can be proved by induction. For n = 1, it is the first mean- 
value theorem. Assuming this true for n = 17, we consider 


Art*f(a) = AL f(a + 4) — f(a)]; 
and our induction hypothesis applied to f(x + h) — f(x) gives 
Artf(a) =a Lfir(e +h) —f(E)], a SE Sa tr 
Applying the first mean-value theorem, we now find 
Ar+1f(a) es Art+iflr+i)(é), 
where &' <&é<€' +A, so that.a < € <a + (r+ 1)jh 


as required. 
In addition to the forward difference operator A, it is convenient to 
make use of the following operators: V, £, u, 6, D. These are defined 
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by Vf(r +1) =f(r + 1) — f(r), Bf) = flr +1), afr) =” f(r — %) 
+ f(r + “)), f(r) = flr + *%) — flr — *%), Df(r) = f(r). These 
operators can be manipulated with reasonable impunity. Among thc 
interrelations between them are 


E=1+A, E=exphD, 6=E% — E-% =2Qsinh AD. 


It is convenient to indicate the entries in a difference table in the for- 
ward, central, and backward notations: 


J(—2) =f(a — 2h) 


J(-1) =f(a — h) H-1 A_,*? = V,* = 6_;° 


A_, = Vo = 6-14 ps A. = Vi = 6_14° 
J (9) = f(a) Mo A.'=V = 4,? Le A, = V2! = 6! 
A, = Vi = 0% bys? A_p= Vi = 518 
JS (1) =f(a+Ah) My A,? = V2? = 6,? 
A, =V:= 63, 


J (2) =f(a + 2h) 

Little useful purpose is served by explicit evaluations of differences of 
functions. The following result, easily established by induction, is, how- 
ever, interesting: 

62"(sin (a + ph)] = (—4 sin? 4h)" sin (a 4+ pf). 


This shows that, for 4 > 147, 6?" need not converge as n — ©. 
We discuss briefly the Newton-Gregory process of interpolation. For 
positive integral p, we have 


E> = (1 + A)», 


thatis, f() =f(0) + (4) afto forest ( We 1) 47-3700) + Af(0). 


The last relation is an algebraic identity, no matter what fis. If we 
allow p to have a general value, we obtain the Newton-Gregory formula: 


f(b) =f(0) + p Af(0) + Mp(p — 1) A2f(0) 
+ “ep(p — 1)(p — 2) AY(0) + °°: 


When /is a polynomial, the differences become zero, the series is a finite 
_ one, and we have an algebraic identity. In general, we can obtain an 
expression for the remainder in the series. 


Observe that S,(p) =f(0) + (4) af00 doutee: (, p 1) A200 : 


a polynomial in pofdegreen — 1. Itis almostevident that S,(p) =/()), 
for p = 0,1,2,...,n —1. Hence S,(p) = Z,_,(f(p), the Lagrangian 
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interpolant being unique. We can therefore use the remainder already 
obtained in (2.10’): 


fp) — Sy(0) = (A) ee re) 


provided f‘” exists, where é lies in the interval containing f, 0, 1,..., 
n—l. 

It is now clear that, with a table of a function and its differences and 
a table of binomial coefficients, we are in a position to interpolate. Al- 
though this is not a preferred method, we shall discuss an example in 
some detail, since it illustrates several general points. 

We shall evaluate {(3.927) from the accompanying table [which gives 
J (x) =e — 50]. We require the binomial coefficients for argument 
.35 = .007/.02. These can be computed or obtained from tables and 
are 


1, .35, —.11375, .06256, —.04145, 


x f (x) A A2 AS Ad 

3.92 .40044 
101816 

3.94 1.41860 2057 
103873 40 

3.96 2.45733 2097 5 
105970 45 

3.98 3.51703 2142 —3 
108112 42 

4.00 4.59815 2184 
110296 

4.02 5.70111 

We find 
f(3.927) = .40044 + 1.01816 x .35 + .02057 x (—.11375) 


+ .00040 x .06256 
40044 + .35636 — .00234 + .00003 
= .75449. (2.13) 


For a check on this, we can think of 3.927 as 3.94 — .65 x .02 instead 
of as 3.92 + .35 x .02, obtain the binomial expansions for argument 
—.65, and carry on as before. 

We note that the behavior of successive terms in (2.13) provides a 
usually reliable way of deciding on the accuracy of a particular inter- 
polation without recourse to the error estimate which is usually taken in 


the form (? A". Inthe present cases it would appear that the contri- 


bution from the fourth differences does not affect the fifth decimal place. 
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We observe that, since f‘) = f + 50 = 50, we can expect a fourth 
difference of about 50 x (.02)4 = .8 x 10-5, that is, about a unit in the 
fifth place. We actually found 5, —3. The discrepancies are 4, —4, 
about one-half the extreme variation possible. It is interesting to check 
that these are due to the rounding of the tabular values. 

We have seen that the Newton-Gregory process is obtained by a mere 
algebraic rearrangement of the Lagrangian interpolant. Other re- 
arrangements associated with Bessel and Everett which are preferred 
above the Newton-Gregory one will now be discussed. They can also 
be generated formally by use of the difference operators. 

The Bessel formula is 


S(p) =f(0) + poy + Baldo? + 4,7) + Byd)3 + Byldo* + 4;*) + °° 


where the coefficients B, are 


B,(p) = B,(1 — p) = p(p — 1)/2!, 
B,(p) = —B,(1 — p) = p(p — 1)(p — ”)/3}, 
B,(p) = B,(1 — p) = p(p + 1)(p — 1)(b — 2)/41,.... 


{It would be more reasonable to write the first two terms in the form 


“aL f(0) + f(1)] + (6 — 2) 444 


The Bessel formula is particularly convenient when p = 14, for then the 
odd coefficients vanish. For example, if we neglect the fourth and 
higher differences, we obtain 


f(¥4) = “f(0) + Af) + (—H6)[52F(0) + 82f(1)], 
which can be written in Lagrangian form as 


f(*%) = Me[—f(—1) + 9f(0) + 901) — f(2)]. 
Perhaps the most useful process is the Everett one: 
fp) = {(1 — p)flO) + of(1)} + (Eade? + Fad.2} + (Eade! + Fed} 4 
(2.14) 

where | 

E,(p) =F,(1 — p) = —p(p — 1)(p — 2)/3!, 

E,(p) = Fa(l — p) = —(b + Ip(b — 1b — 2) (2 — 3)/54, 
Use of the terms within the first pair of braces is equivalent to linear (i.e., 
2-point) interpolation; use of the terms within the first two pairs of braces 
is equivalent to cubic (i.e., 4-point) interpolation; andsoon. It follows, 
from the form of the remainder in the Lagrangian 4-point case, that we 


can interpolate in a table of sin x at an interval of .2 and have an error 
of less than 4% x 10-4 if we use the first four terms in (2.14). 
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We repeat an example from Chap. 1. To find sin 1.234, we use the 
entries: 


x sin x 6? 
1.2 .9320 — 371 
1.4 .9854 — 392 


The Everett coefficients for p = .17 are E, = —.0430, F, = —.0275, and 
we find 
sin 1.234 = (.83 x .9320 + .17 x .9854) 


+ (.0430 x .0371 + .0275 x .0392) 
= .9438. 
Owing to the symmetry of the Everett formula, it is not possible to check 
the computation by using 1.234 = 1.4 — .83(.2). It is quite easy to 
misread the coefficients, or to give them a wrong sign, and so some 
checking is desirable. We can use the Bessel method. The extra 
difference (—21) which is required can be obtained mentally. The 
Bessel coefficients for = .17 are B, = —.0353, B, = .0078, and we find 
sin 1.234 = (.83 x .9320 + .17 x .9856) + (—.0353) (—.0763) 
+ .0078( —.0021) 
= .9438. 


We shall now discuss the “throwback” (see Chap. 1 and NPL 1). It 
was Observed that the ratios 


xX p?— 2p —3 


E, 20 
BB,  p*’—p-—2 
and B12 


are approximately constant for0 <p <1. (The readershould drawa 
rough graph of the two quadratics.) If, therefore, a reasonable mean 
value & of these ratios is chosen, we can include the fourth-difference 
contribution by “modifying” the second difference: 

Smif = Of + k oy. 
Various ways of choosing k have been discussed, and the preferred value 
is k = —.18393 (see, e.g., Abramowitz [9]). 

The use, in practice, of this idea is along the following lines. The ~ 
tablemaker computes f, 6?f, 64f but prints only f, 6,,7f Ifthe user treats 
the 6,,?/ just as he treats 62/, he receives a bonus, the major part of the 
contribution of 6‘f, without having to obtain the £,, F, or todo the multi- 
plications and additions. The actual error incurred by the modification 
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is less than half a unit if the fourth differences are less than 1000 and the 
fifth difference less than 70. 

If we return to the case of a 4D table of sin x, we now see that an in- 
terval of A = .5 will practically suffice. That is to say, the table 


x sin x bm 
0 0000 0 
me) .4794 —1225 
1.0 8415 —2154 
1.5 9975 —2552 
2.0 9093 — 2326 


will be adequate. For instance, since for 


p=%, £E, = —.0625, F, = —.0625, 
we have 
sin 1.25 = .9195 + 46(.2154 + .2552) = .9489. 
Actually, 
sin 1.25 = .9490, 


and the discrepancy can be explained by round-off or by the marginal 
choice of A. 


2.8 Comparison of Methods 


There is a violent transatlantic controversy about the methods dis- 
cussed in Secs. 2.6 and 2.7. On the whole, the Americans favor the 
Lagrangian methods and the Europeans the difference methods. We 
shall mention some of the arguments brought up. 

There is no doubt that the Lagrangian method should be used only for 
a guaranteed table; the calculation of the differences required for the 
other methods provides a check on the reliability of a new or doubtful 
table and, the differences having been obtained, the use of the difference 
methods is usually less laborious. There is, however, the question of 
what order of interpolation to use. Only in a few tables is this given, 
and so, if one does not want to compute differences, one 1s forced to 
carry out more than one interpolation and to observe the behavior of the 
interpolant; for this the Aitken scheme is efficient. Since the sum of the 
Lagrangian coefficients is necessarily unity, one can provide a good 
check in desk computations by accumulating the multipliers as well as 
the product. This check is not available in the finite-difference methods 
(except in the linear case), where the varying orders of magnitude of the 
multipliers and the differences encourage wrong settingsin the calculator. 
We have already pointed out that it is possible to check the computations 
in the Bessel case by doing the interpolation for p and for 1 — p. An- 
other advantage of the Bessel method is that one can use the exact order 
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which is necessary—the Everett method always gives an even order 
interpolation. Neither of these methods is available at the beginning or 
end of a table, unless the missing differences are supplied (e.g., by guess- 
ing or by using properties of the function, e.g., parity); the Newton- 
Gregory method is suitable for the beginning of a table and is easily 
modified to handle the end of a table. 


2.9 Auxiliary Functions 


It has been clear from examples and from the form of the remainders 
that interpolation becomes awkward in the neighborhood ofa singularity 
of a function or of one of its derivatives. One of the ways of easing the 
situation is by the introduction of auxiliary functions, which may be 
additive or multiplicative. We choose functions a(x) or m(x), which are 
already tabulated (or easily computed) and simpler (in the sense of per- 
mitting easier interpolation), in such a way that a(x) + f(x), or m(x)f(x), 
is ‘smoother’ than f(x) itself. | 

We illustrate this by two simple examples. Consider cosec x near 
x =0. The difficulty in interpolation is evident from the accompanying 
table for cosec «x: 


x cosec x cosecx — x71 x COSeC x 
0 00 0 1.000000 
833 4 
.005 200.001 .000833 l 1.000004 9 
834 13 —] 
.O10 100.002 .001667 —1 1.000017 8 
833 21 0 
.015 66.6692 .002500 0 1.000038 8 
833 29 0 
.020 50.0033 .003333 l 1.000067 8 
834 37 l 
.025 40.0042 .004167 0 1.000104 9 
834 46 
.030 33.3383 .005001 1.000150 


However, interpolation in the tables of cosecx — x! or x cosec x is trivial, 
and we can obtain the value of cosec x by addition (after reference to a 
table of reciprocals) or by adivision. The choice of auxiliary functions 
is usually easy. In the present case they were suggested by the power 
series 

x cosec x = 1 + Mex? + P4e0x4 + ---, 


which is convergent for |x| < 7. 
As a slightly more sophisticated example, we discuss the case of 
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y = arcsin x near x = 1. The choice of an auxiliary function is moti- 
vated as follows. We have 
arcsin (1 — e) = 4am — arccos (1 — e) = 4m — arcsin (2 — e*)” 


= Van — V2 + O(c). 
Hence arcsinx + V2(1 — x) = Mn + O[(1 — x)"*]. 


x y = arcsin x 6*y z=y+ V21 —x) Az 
94 1.2226 1.5690 
95 1.2532 32 1.5694 . 
.96 1.2870 44 1.5698 , 
97 1.3252 71 1.5702 
.98 1.3705 135 1.5705 ° 
.99 1.4293 827 1.5707 : 
1.00 1.5708 1.5708 


To find, for example, sin .9512 from the original table, we have to use at 
least cubic interpolation, but with the new table we can use linear inter- 
polation and a square-root table to get 


arcsin .9512 = 1.5694 + .12 x .0004 — V2 x .0488 
= 1.2570. 


In the case of a function which oscillates with a slowly varying ampli- 
tude and period, it is appropriate to look for auxiliary functions A(x), 
B(x) such that 

F(x) = A(x) cos [B(x) + e]. 


The appropriate choice is often suggested by asymptotic representation. 
For instance, the asymptotic expansion for J,(x) suggests that we repre- 
sent it in the form 


J,(x) = Ay(x) sin x + Bo(x) cos x. 


The efficiency of this is evident on referring to BAAS 6, 10. A straight- 
forward table for the range 0 to 25 occupies 200 pages, whereas a com- 
parable 8D table giving A,(x), By(x) for the range 25 to 6000 requires 10 
pages. 

It is appropriate to mention here the use of rational (and, in particular, 
polynomial) approximations to functions. Given such an approxima- 
tion, interpolation is replaced by the calculation of two polynomials (see 
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Chap. 1). The use of such an approximation has increased with the use 
of automatic computers, and there is considerable activity in connection 
with the automatic generation of such approximations. For a severely 
practical account of these matters, we refer to Hastings [11]. See also 
NPL 6 and Chap. 3. 

It is also appropriate to indicate here how changes in the dependent 
or independent variable or in both can facilitate tabulation and inter- 
polation. These changes are often suggested by asymptotic expansions. 

For instance, consider the tabulation of I(x) for x > 100. From the 
asymptotic expansion it is apparent that 

A(x) = — 
e-*(27) 


is approximately linearinx-!. A table of f,(x) for x-? = 0(.001).01, for 
example—together with appropriate auxiliary tables—enables I(x) to 
be obtained throughout the range to about 8D. .@, 

Another example is the tabulation of Ei(x) =| t-le—* dt for 10< 


|x} << oo. It has been shown that a table of 7(x) = e-*Ei(x) — x7! for 
x-! = —,1(.01).1 is interpolable to about 8D, using the Everett process, 
with modified second differences (see Fox and Miller [12]). 
Finally, consider the tabulation of J,(x) for large x. We have 
Jo (x) ~ (2/ax)4[Po(x) cos (x — Mar) — Q(x) sin (x — Mn)], 
where P(x) and xQ,(x) are power series in x-?, This suggests the use of 
x-® as an independent variable; it has been shown by Hartree [13] that 
this change is very efficient and that a table of 41 entries, in which linear 


interpolation is good to 7D, covers the range 5 <x < ©. 
For further examples see NPL 4 and RSS 3. 


2.10 Errors in Interpolation 


We shall discuss here the errors which occur when we compute /(f) 
according to the methods we have described. 

In the first place, the argument f may be uncertain, and if the amount 
of uncertainty is e, the consequential uncertainty in {(p) will be €f'(p). 
If fis given analytically, it may be possible to estimate f '(); otherwise 
we can estimate it as h-! Af. For instance, if f(x) = sin x, the uncer- 
tainty in f{(p) does not exceed that in p; if f(x) = arctan x, the same is 
true for all p, but for large p the uncertainty in /() is much smaller. 

Should this type of uncertainty be present, its effect must be examined 
first, in order to ensure that an appropriate interpolation method is 
chosen. Itis manifestly uneconomical and misleading to use an inter- 
polation process in which the intrinsic errors are much less than the un- 
avoidable uncertainties. 
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We have given already an admittedly academic error estimate for the 
Lagrangian process and therefore for the equivalent difference methods. 
We now examine an interpolation process more carefully; for simplicity, 
we discuss linear interpolation in a d-decimal-place table. We want a 
value of f(p), being given f(a), f(b). Now, we are not usually given /(a}, 
f(6), but rather approximations to these, f(a), f(6), which are obtained, 
for example, by rounding the true values. The difference between 


Fla) +$ —* (7H) —Fla)] and fla) +2 — Lf(6) — fla] = Le 


p-—a b—p 
rea WA — f(5)] Pace t (a) — f(a)]. 


Let us assume that the fare obtained by rounding the f, so that the tabular 
error is at most e = %-:10-4. Then the above difference is at most 


However, we do not actually compute f(2) + — [ f(s) — f(a), for 


(p — a)/(5 — a) will not be integral in general, and we shall have to incur 
a rounding error of amount at most e«. Thus the difference between the 
result of our computation and L,() is in absolute value at most 2e. In 
order to find the actual discrepancy, we must add the truncation error ; 
if, as is usual, we arrange for this to be at most e, the total absolute error 
will be at most 3€ (see Ostrowski [14]). 

As an example, let us find sin .43 by linear interpolation between 
sin .42 = .4078, sin .44 = .4259. We find .4078 + 4 x .0181, which 
rounds to .4168, whereas sin .43 = .4169. The values of the data to 6 
places are .407760, 425939, and sin .43 = .416871. 

Let us now consider what happens in the nonlinear case. Essentially 
similar considerations apply in the Lagrangian and difference methods. 
Assuming p to be given precisely, we have to obtain the Lagrangian or 
Bessel or Everett coefficients. In general, these will not be obtained 
exactly—we may, indeed, have to obtain them by interpolation. Then 
there are the errors due to tabular errors in the Lagrangian case and due 
to these tabular errors and to consequential errors in the differences used 
in the other cases. 

If modified differences are employed, we have to take account both of 
the errors caused in their calculation and of the additional truncation 
errors caused by the modification. 
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It is clear, therefore, that to make full error estimates in an interpo- 
lation process is alaboriousjob. The results of such estimates are avail- 
able in the booklet IAT, to which we have referred. 


2.11 Inverse Interpolation and Related Topics 


It is not often that one has to interpolate in a table given at unequal 
intervals. Ifmuchinterpolation insucha table is required, it is probably 
best to produce from it an interpolable table ataconstantinterval. This 
can be done by using the original Lagrangian method, although it will 
now not be possible to obtain the coefficients from tables—they will have 
to be calculated. The Aitken process is, of course, very suitable here. 

The concept of divided differences is used in these circumstances. 
Suppose f(x) given at Xo, x,,%2,.... The first divided difference is 
defined by 

SA %i%its) = P41) —S 4) 1/41 — *)- 


The second divided difference is defined by 
A (Xitisv%ize) = (Pf isv%ire) —S%i41)]/ Cire — %1)- 


And similarly for higher divided differences. It can be shown that the 
nth divided differences of a polynomial of degree at most (n — 1) vanish. 
Thus the equality of the nth divided differences is a test for a function to 
coincide with a polynomial of degree at most n. The approximate 
equality will indicate that an (n + 1)-point interpolation will be appro- 
priate. 

We note that, when x, = x) + 1h, we have 


S(*o.%1 2 X,) = A A” f(0)/n! : 


We shall discuss the calculation of J,(1) given Jo(1), Ji,(1), Ai.(1), 
J, (1), Ja,(1), J,(1). We begin by forming a table of divided differences: 


12n J,(1) 
0 7652 
43 
3 7522 42 
213 2 
4 .7309 24 
332 2 
8 5979 12 
392 1 
9 5587 1 
395 
12 4401 
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We carry out, by the Aitken process, a 5-point interpolation: 


0 7652 

3 7522 7392 

4 7309 7138 6630 

8 5979 6397 6795 6712 

9 5587 6274 6833 6711 6714 
Actually, 


Jyg(1) = 6713967... 


The problem of inverse interpolation is to solve for x the equation 


f(x) = 4, 


a and a table of f(x) being given; we are evaluating, in fact, f-!(«). This 
is trivial in case the table permits linear interpolation, for then x = a + 


6k, where 
a = f(a + 0h) = f(a) + OL fla +h) —fla)], 


a — f(a) 
f(a +h) — fla) 


If linear interpolation is not permissible, a convenient method is to 
subtabulate the table (of course, only in the neighborhood of the required 
x) until a table which is interpolable linearly is obtained. This sub- 
tabulation can be done by any of the methods described; if much sub- 
tabulation is to be done, it is worthwhile making use of special methods 
(see [15]). We consider the determination of the zero p of f(x), where 


f(0) = 9480931, _f(1) = 4286597, 
f(2) = —905580, (3) = —6095598. 
We find that the second differences are large (2157, 2159) but that the 


third is negligible. We estimate the position of the zero by linear inter- 
polation as 


so that 
6 — 


4286597 


5192077 — 1.8255. 


p=1+ 


We compute by 3-point Lagrangian interpolation: 
f(1.820) = 28853, = f(1.825) = 2895, (f(1.830) = —23062. 


The second difference is now —1, and so linear inverse interpolation is 


permissible. We find 


2895 
p = 1.825 + 55957 x .005 = 1.82555766. 
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The same example can be carried out by the Aitken process, which 
can be presented as follows: 


9480931 0 

4286597 1 1.82524478 5194334 

—905580 2 562383 1.82555771 10386511 5192177 
—6095598 3 600328 796 1.82555766 15576529 10382195 5190018 


The numbers on the right are the divisors used. In many cases of direct 
interpolation it is not necessary to record them explicitly. 

A word of caution is necessary here. Unless = 2, it is not true that, 
if n-point interpolation is permissible in a table, then n-point inverse in- 
terpolationis permissible. Thisis bestillustrated byan example. Con- 
sider the determination of the zero of f(x) where 


f(0) = —342, fll) =—218, (f(2) = 386, f(3) = 1854. 


Actually f(x) = (4x + 1)3 — 343, so that 4-point direct interpolation is 
exact, and the zero is 1.5. A 4-point inverse interpolation gives 


— 342 0 — 342 
—218 ] 2.7581 —218 124 
386 2 .9396 2.1018 386 728 604 
1854 3 .4672 171 1.9926 1854 2196 2072 1468 


For another example, see Hartree [16]. 
Another approach to this problem is the following. We can consider 
the reversion of the Lagrangian expansion and obtain 
pHe=rtar’r+-artee:, 


where the a, are rational functions of the tabular values and where r 
also depends on the given value « = f(p). For simplicity, we give the 
formula in the 4-point case (see Salzer [17]): 
p=r—rs + (25? — t) + r§(—S55? + Sst) + --- 
where r= 6[f(p) —f(0)]/A, 
s = 3[f01) — 2f00) +/(-)I/A, 
t = [f(2) — 3f(1) + 3f(0) —f(—1)]/4, 
A= —f(2) + 6f(1) — 3f(0) — 2f(—1). 
In the above case we find, with 
f(—1) = —342, = f(0) = —218, f(1) = 386, f(2) = 1854, 
that 
r = 1308/1800, s = 1440/1800, t = 384/1800, A 


which gives 


1800, 


p = .7267 — .4224 + .4093 —--- 
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The behavior of the terms here suggests that we are in trouble. It 
should be contrasted with the behavior of the same formula in a reason- 
able case. 

It is possible to obtain rules for the neglect of the contributions of 
differences in inverse interpolation. We shall not discuss these in detail, 
but the reader should refer to the literature cited earlier. 

A related problem is the determination of the position #, of the mini- 
mum or maximum of a tabulated function, or, more generally, where a 
tabulated function has an assigned derivative; this has been discussed 
thoroughly by Salzer. We discuss directly a 3-point method. We fita 
parabola y = a + bx + cx* to f(x) atx = —l,x =0,x =1. Then 


a=f0), 6 =ALf0Q)-f(-)), ¢ = AL) +f(—})). 


We obtain for the abscissa and ordinate of its vertex the expressions 


ae _l4 f(1) sap) 


Be OF = 0) FAD 
-,_, 4ac — bP WiG) Boece A Gant 00 
ant FOE TG FO 8 Fay 970) FFD) 


2.12 Multivariate Interpolation 


Much of the preceding material on univariate interpolation can be 
extended to the multivariate case. The naive approach to the problem 
of finding f(f,q) in a double-entry table is to do successive univariate 
interpolation; that is, we find /(0,¢), f(1,¢), ... by interpolation in the y 
direction and obtain f(,q) by interpolation in the x direction among 
these values. Ifmuch interpolation is to be done, more efficient methods 
must be sought. In good tables, advice for interpolation is given, and 
where interpolation is thought likely to be necessary, the table is planned 
to make it as comfortable as possible. 

An important special case of multivariate interpolation is in tables 
of regular functions of a complex variable. Here use of the Cauchy- 
Riemann equations can greatly facilitate matters. We refer to the in- 
troductions of the many recent tables of this character for various 
approaches. 

Special methods for functions of particular form are often convenient. 
For instance, in the case of functions defined by integrals, Gaussian 
quadratures may be applied (see, e.g., Todd [18]). 

The use of the throwback for bivariate interpolation has been discussed 
by Southard [19]. | 

Multivariate interpolation has been the object of some recent research 


by Milne and his collaborators and by Thacher [20]. 
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QUADRATURE AND DIFFERENTIATION 


2.13 Introduction 


The problem with which we are mainly concerned is the numerical 
evaluation of 


=|"f0 ae 


for one or more values of x, where f(t) 1s given analytically, or by a table. 
A comprehensive bibliography has been given by Stroud [68]. 

Occasionally the simplest solution to this will be provided by obtaining 
the indefinite integral analytically and then referring to tables of the 
functions involved. Among the more elaborate tables of integrals in 
common use are those of Grébner and Hofreiter [21], Bierens de Haan 
[22], Byrd and Friedman [23], and Ryshik and Gradstein [24]. In 
many cases the explicit analytical form of F(x) may not be very helpful 
(an example of this is given in [25], p. 130). However, even if this solu- 
tion is too unwieldy for general use, it may be valuable for checking—for 
example, the final value. 

An account of some of the ways of reducing integrals to more manage- 
able forms is given by Abramowitz [26]. 


2.14 Lagrangian Formulas 


The idea now to be exploited is the following. In order to evaluate 


I = [se dx 


approximately, we shall evaluate the integral of an approximation to 
f(x). We make use of the results already available about the approxi- 
mations to functions by polynomials. 

For instance, if 


€2%5 a hy SO SR SO 
and if 
L(x) = d Sxl (*) 


is the corresponding Lagrangian polynomial, we can consider 


Q=[ Lf) de = ¥ fle) [1() ae = EAS ed 
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as an approximation* to /. We note that the coefficients A; do not 
depend on the particular function being integrated, only on the nodes 
Xo) X1, +++ ,%X,- If we take special cases, such as the one in which the 
nodes are equally spaced, the tabulation of the A, is feasible, and a con- 
venient solution to the problem is available. Before dealing with the 
general case and discussing the errors involved, we take up a few particu- 
lar cases. 

(a) The case in which n = 1, when /(x) is approximated by a linear 
function, gives the following: 

x—a 


f(x) = f(a) +—* Lf) —fl@)]}, 


so that 
Q = (6 — a){[ f(b) + f(a)]}. 


This is the trapezoidal rule: the area under the curve is approximated by 
the area of the trapezium. 

It can be shown that J — Q = —6?/12, where 6? 1s the second difference 
—this statement, and similar ones for other quadratures, are elucidated 
in a discussion at the end of this section. 

It is important to note the power of this simple method in dealing with 
the numerical integration of periodic functions over a full period. Fora 
full discussion we refer to Birkhoff, Young, and Zarantonello [61], Davis 
[31, 62], and Hammerlin [63]. 

(6) Consider next the case in which we have the 4 points —1, 0, 1, 2 
and approximate the integrand by a cubic C(x). If we take 


C(x) = fg + ax + bx*® + cx3, 
so that C(0) = fj, and require in addition that 
C(-1)=f-yn CU) =f C2) =Sa 


we can obtain a, 6, ¢ as linear combinations of the f,. In fact, 


a= —4f_, a lef +h — V6 fos 
b= ef, —fo oT vfs 
CS —Mef1 Ts Mf = “ef, ae Ye fe: 


*Since L,(f,x) = f(x) if f(x) is a polynomial of degree at most nit followsthatQ = J 
in this case. This result clearly remains true in the weighted case when we consider 
the approximation of 


b 
T=] f(x)p(x) ax 
by Q = 2A, f(x;) where now 
b 


A; -{ I” (x) p(x) dx. 


Google 


CLASSICAL NUMERICAL ANALYSIS 61 
‘We then have 
2 
Q -| C(x) dx = 3f, + %a + 36 + 1%e 


= M(f1++ 3+ 3 +h). 


This is called the three-eighths rule. It is of the type known as closed: 
the end ordinates are used. 

It can be shown that J — Q = —364/80, where 6* is the fourth differ- 
ence. 

(c) Consider next the case when we have 5 points, but approximate 
the integrand by a quadratic through the 3 interior points. 

If we take 

g(x) =fo + ax + bx* 


so that ¢(0) =f and require in addition that 


gQ-l)=f-,p gl) =A; 


we can obtain a, 5 as linear combinations of the /;,5 We have 


Q = glx) de = 4fy + 186 = 62h, —So + Fa). 


It can be shown thatJ — Q = 1%564. This formula has been exploited 
by Milne. It is of the open type: the end ordinates are not used, and it 
can therefore be used as a predictor in the solution of differential equations. 

We note that the error incurred in the use of open-type formulas is 
likely to be larger than that of closed-type formulas. We have pointed 
out already the growth of the error term in a Lagrangian interpolation 
as we move away from the center and particularly as we get outside the 
nodes, that is, when we extrapolate. 

(d) Another common integration formula is Simpson’s rule: 


[ fe) de = (fa + 4fo +A) 


This can be derived by integrating a cubic which coincides with f(x) at 
—1, 0, 1 and, in addition, has the same derivative at x = 0 as f(x). 

It can be shown that J — Q = —6*/90. 

(e) Among the many other quadrature formulas of this type is 
Weddle’s ‘rule: 


[fe de =%0(fo + 5 th + Oh +h + 5h +f) 


It can be shown that J — Q = —6®/140. 
(f) Acollection of these formulas has been made by Bickley [27] (see 
also IAT). 
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We return now to more theoretical considerations of quadrature for- 
mulas of the Lagrangian type. 

We have first of all to estimate the error committed. For clarity we 
consider an ordinary Lagrangian (n + 1)-point scheme for a function 
F(x) which has a continuous (n + 1)st derivative. We know that 


L foes) 
f(x) = — n(*) i aa (n at 1)! 
If we consider integrating this with respect to x, we must remember that 
in general € depends on x. About the best we can do is to replace 
ft (E) by Ma. = ne a aia | We then find 


) (e _ Xo) (x ~ x) =e (x ~ x) 


ae 
1 Q «Mes Gb aad made a de 
A crude estimate of the aaa is (6 — a)"*?, so that 


IP — Q,l < Maii(b — a)"*?/(n + 1)! 


This is sufficient to show that Q, — J as n +0 whenever f(z) is an 
entire function. 

We have given, with each of our formulas, an estimate of the error 
committed. Such estimates can be obtained in various ways. Perhaps 
the method due to Peano [28] is the most systematic. 

We discuss the Simpson’s-rule case only. We begin with the relation 


[uo dt = uv” — u’v' + u’v — [ure dt. 


Applying this when u = Yéx(1 — x)’, v = f(x) + f(—x), g = uv”, we 
find 


[s dx = 4[ f(1) + 4f/(0) + f(-1)] —| 40) Fade. 
20: [- St) af) + 4f(0) +f(—-1)] — [ee dx. 


Now if f(t) is continuous, f "(x) —f"(—x) = 2xf'"(), —x <b <x. 
Hence, if | f'"(t)| < My, we have 
M, 


[ me dx ao 


If we argue more closely we find, in the case of intervals of length A, 


1 
= 2M | x*(1 — x)? dx = 
0 


1-Q=-2 fog 
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Recalling that the fourth difference, at interval 4, is approximately 
h4f'*)(x), this gives 
I — Q = —A64/90, 
where 6‘ is an “‘average’’ fourth difference—thisis the result stated above. 
To clarify the situation, let us take the case of the evaluation of 


5 
5(.5,.5) | exp (—.95 sec ¢) dt, 
0 
using A = .01. There will then be 25 intervals, and the error in each 


will be estimated as —A- h*f‘*)(£)/90, so the total error is in absolute 
value at most 


25h°M, ae 
90 <10 
if Af, = max | f'4)(x)| < 4, which seems a reasonable guess. (See 
O<z7<.5 
Sec. 4.24.) 


We note here how we can improve the accuracy of a formula such as 
Simpson’s by applying it to subintervals of [2,5], so that “h”’ is reduced. 
For simplicity, we discuss the case in which we consider a double appli- 
cation of the rule. We have to compare 


I, aaa 78h( fo 5 4f, + fa) 
with I, = Yah( fo + 4h + fe) + “hfe + 4fs + Si). 


Assuming the same bound M™, for f), the error |/,; — Q| is approxi- 
mately (2h)5M,/90, whereas |/, — Q| is about 245M,/90, an improve- 
ment by a factor of 24 = 16. 


2.15 Quadratures Using Differences 


(a) One of the most efficient quadrature formulas is due to Gauss and 
uses central differences. The basic relation is 


h-} [ fC als 


8D fis 
(2 Ses ) 
i: 5 sinh 5 Uf vs 
a wl2(6 1 6 yy 

IAs lS. Se aneats 
ian (5 2°38 T Mx 
= (1 — 14462 + K%agd4 — ---)(1 + al? 

— 14456064 + ++ \ uf vg 

= (1 — M26? + 2064 — ++) ufiy. 
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How this is used in practice is indicated by the following beginning of an 
ra] 
(1 — x®)* dx, using anintervalh = .05. The inte- 
/0 

grand is tabulated and differenced. ‘The starred entries are not offi- 
cially available if we restrict ourselves to the given values of the inte- 
grand. However, these may be estimated in various ways; for instance, 
the entries —1, —3 in the last column are within the range +2¢ of 
errors due to rounding of the initial values, and it is reasonable to 
assume a zero fifth difference and a difference constant at —22, from 
which the third difference, —6, and the second difference, —2505, can 


be obtained.f This being done, the first column to the left of the argu- 
rnent column is computed; it is 


/ : , 7 
acl as 10S? se Tying 54 aa ee ot 


evaluation of J = [ 


from this cotumn, that labeled 6/ is obtained by averaging, and from 
tle éf column, that for J is obtained by addition. 

It is clear that the errors in our estimates of the missing differences are 
obliterated by the multiplying factors —A/12, +114/720,.... 

‘Che correct value, obtained analyticalty, 1s 


Pea) 3, oe =e 83 0G, 


12 
I(x) él x (1 — x7)'4 0? 64 
0 50010 0 1.000000 —2502* —18* 
49979 —1251 —9* —1* 
49979 49948 .05 998749 2511 19* 
49854 3762 28 3* 
99833 49760 = .10 994987 2539 22 
49602 6301 50 
149435 49445 .15 988686 2589 
49223 8890 
198658 49001 .20 979796 


(6) The Gregory method uses only differences which are actually 


available. 


The formula is 


[Ad = fo cat se eerie oP ee ea Ee ole Vi2(Ay Na) 
0 
— 44(A,? + V2) + 12720(A,3 — V,3) — Meo(Agt + Vi) +---. 


+ Thus the starred entries in the table are estimated as: 
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—6 


0 
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Notice that the differences involved are the first and the last which 
are obtained in differencing the integrand at values within the range 
of integration. The basis of a formal derivation follows. Since 
D = log (1 + A), we have 


1 l 
D~ A — (A®/2) + (A3/3) —--- 
( A 2? )* 
Pe eS || Mee as Se eek. ies 
A 2° 3 


(c) The Euler-Maclaurin formula is often convenient, either for the 
evaluation of a sum or for the evaluation ofan integral. The formula is 


nh n 
mi] fx) de = |"f db = fy + fr to + Sua t YS 
«0 0 
— Akt — fo) 
+ Veoh (f" — £6) 
— Moaah5( f,) — fo) 


It can be derived by the use of the finite-difference operators as follows. 
We note that 


be fl, Ue Bes Bates Beas 
S21 al Dp 2 a Ge 
] l ] ] ] 
ee ey aa eet toe Se 3 5 __ 
=p-at 7? 70” + 30050? 


where the B’s are the Bernoulli numbers defined by the generating 
function 
B B B 
VA Vapi I epee Aes SG: es 
Met cot at = | ee a ee 
A careful discussion of this formula and a derivation of an error esti- 

mate can be found in Knopp [29]. For the present we discuss an ex- 
ample, the evaluation of & n-2. If we let n —> o in the above formula 
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and if all the derivatives f'?+!)(n) +0 as n — oo, we obtain, formally, 
¥ fin) =[" fla) de — 14f(0) — Maf'(0) + Mao f (0) — ++ 
If we take f(z) = (10 + n)-?, we find 
ye =(1+2%4+-+++410-) + ¥ fir) 
= 1.54976 77312 + [ox dx — ¥4(10)-* + 4%2(2!)(10)-3 


— V420(4!)(10)-5 — --- 
= 1.54976 77312 + .10000 00000 — .00500 00000 
+ .00016 66667 — .00000 03333 + .00000 00024 +.--- 
1.64493 40670. 
This is to be compared with 


77/6 = 1.64493 40668. 


| 


ie #) 
A naive approach to the evaluation of > n-? shows that the remainder 


1 
after n terms is about n—}, so that a direct summation is not feasible. 


In a similar way, it is possible to evaluate > (2n + 1)-? by applying 
0 


the Euler-Maclaurin formula to the tail (21)-? + (23)-? + (25)-2 + -- 
of the series. We have 


iz dx sll 
0 (21 4+ Qx\2 2-21 
and so 


ie 8) 


] 
> (Qn + 1)? = (1 + 3-2 +--+ 4 19-2) 4 — 
0 2-21 


(; ] 1 4 1] 24-8 1 720-32 
+ Le eo 


207 7 72913 ~ 720 21° «+ 30040 217, 
— 1.20872 1307 -+ .02380 9524 + .00116 9715 
1.23370 0548, 


which is to be compared with 


n?/8 = 1.23370 0550. 


2.16 Gaussian-type Quadratures 


We have noted that a Lagrangian (n + 1)-point quadrature, with 
arbitrary nodes, is exact for polynomials of degree at most n. Can we 
do better than this if we choose the x, cleverly? 
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If we consider the equality 


{ f(s) de = DAflx) 


or, more generally, 


| flx)p(x) dx = DAS(x), (2.15) 


where f(x) is a fixed, positive weight function, we see that there are 
(2n + 2) constants on the right. One might expect to evaluate these by 
requiring that (2.15) be satisfied for f(x) = x’, r =0,1,...,2n 4+ 1. 
In this case, since the operations are linear, there would be equality in 
(2.15) when fis a polynomial of degree 2n + 1 at most. We shall show 
that this is indeed possible, relying on the general theory of orthogonal 
polynomials. 

Let fo, fi>--->Sn>--- be the normal orthogonal system constructed from 
1,x,...,2",.... Letx, = «,i = 1, 2,...,2 be the n real zeros of 
S,(x). Let f(x) be any polynomial of degree 2n + 1 at most. Then we 


can write 
S (x) = Q(x) fnsal™) + 1(x) (2.15) 


where the quotient g(x) and the remainder r(x) are each of degree n at 
most. We then have 


[ He)p(2) ae = [ als) fusrle)e(s) dx + [r(op(e) de 


By orthogonality, the first integral on the right vanishes. Hence 


| Aeerpla) ax = [r(ayp(a) a 


and, if we use an (n + 1)-point Lagrangian quadrature, the integral on 
the right 1s exactly 


> Arlx)s 
But, from (2.15’), since f,,;(x,;) = 0 for all z, 
f(x) = r(%i). 
Hence [foe dx = > A, f(x,); 
that is, the quadrature based on the nodes {x‘"+)} is exact for f. 


The coefficients A, are often called the Christoffel numbers. It can 
be shown that they are always positive. 
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The fact, just established, that an (n + 1)-point Gaussian quadrature 
is exact for polynomials of degree at most 2n + 1 suggests that, in the 
general case, the error might be a multiple of f(?"+?)(é). This can be 
established by integrating the Hermite interpolation formula 


_ ZF enter &) ng) _ : 
fl2) = Halt) + ey TT E = &(x) 

to get 
[fore de — EAs) =F [TT oe — 2) 900 ee 


a<& <6. 


The integral on the right does not depend on f and can be evaluated 
once for all; the values in the classical cases are given in Chap. 3. 
The values of the abscissas and the Christoffel numbers or multipliers 
in various cases have been tabulated adequately (see, e.g., references 
given in Chap. 3). = 
We discuss one example, the calculation of £,(z) = [ (e-“/u) du, for 


two cases: (a) for z = 10, areal value of z, and (6) forz = 10 + 51. We 
change the variable and obtain 


eE,(z) =I, —il, 


where 


Be er eee esl =|'e J 
hale (x + ea scat Cre er 


The Laguerre quadrature 


T= [ef dt SAMf(em) = Q 


is applicable. We shall use the values of A;""’, x,(") obtained by Salzer 
and Zucker and shall use n = 5: 


x A ,{) 


.26356 03197 18 92175 56105 83 
1.41340 30591 07 .39866 68110 83 
3.59642 57710 41 .07594 24496 817 
7.08581 00058 59 .00361 17586 7992 

12.64080 08442 76 00002 33699 723858 


(a) In this case, J, = 0, and we have 


] 
N=Q, ge eer 


t 
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All that is now required is to make a table of reciprocals of 10 + x; and 
accumulate the product A,(10 + x,)-}. We find 


Q, = .09156 33319, 
which is to be compared with 
I, = e%£F,(10) = .09156 3334, 


with an error of about 2 x 10-°. 
(6) In this case, both J, and J, are different from zero. We have 


—10 + x, pega es ) 
41 = 1 = 2 Ae I= Qs = DAE 0)? 
and we begin by making a table of 
—10 + x,, 25 + (x; — 10), A, = A,/[25 + (x, — 10)?] 
for i = 1(1)5. We then obtain the Q, and Q, by accumulating the 
products 2 A,(x; — 10) and 2 5/,. We find 
— I, = —.08475 7264, 
I, = .04826 1807, 
which are to be compared with the correct values: 
Re[exp (—10 + 51) F,(—10 + 52)] = —.08475 749 
Im[exp (—10 + 52)E,(—10 + 52)] = —.04826 039. 
The error is of the order 2 x 10-8. 
It is easy to estimate the error. Indeed, the 2nth derivatives of 
(x + #)/[(* + #)® + 57] and y/[(x + 4)? + 5°] are 
(n!)#r,-*"-1 cos (2n + 1)6 and (n!)?7,-2"-1 sin (2n + 1)0, 
where 7; = [(x + &,)? +77]*, 0 < & < «; and 6, is defined by 
cos 6; = (x + &,)/r;, sin 6; =_y/r,;. Hence the errors do not exceed 
(n!)? (n!)? 
Cee me aaa Lyte 


We observe that the bound increases as z approaches the negative real 
axis and that there is an optimum value of n, about |z| in the case n > 0 
and about || when x < 0. 

For further details see Todd [18]. 


2.17 Use of Functional Analysis 


In all the methods discussed so far, the error estimates depend on a 
value of a high derivative of the integrand at an intermediate point. In 


x >0; or x <0. 
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many Cases it 1$ not easy to find reasonable bounds. An alternative 
approach, by the methods of functional analysis, has been developing 
recently: the error is expressed in terms of anorm ofthe integrand. For 
an example, we refer to Chap. 1, Sec. 1.6(a). The method is elaborated 
in Chap. 14 and in various papers—for example, those by Davis [30, 31] 
and Langer [32]. 


2.18 Comparison of Methods 


Many of the arguments given in Sec. 2.8 regarding the relative merits of 
Lagrangian and finite-difference methods for interpolation apply again 
in the present context. Here we have also to evaluate the Gaussian- 
type methods. It is clear that the latter are not likely to be very practical 
when the integrand is tabulated at equal intervals, for preliminary inter- 
polations will be necessary to evaluate the f(x;), and these may out- 
balance any gain due to thesmallererrorestimates. On the other hand, 
if the integrand is not tabulated, the Gaussian type may be very con- 
venient; this is certainly so in two cases: (1) when automatic computers 
are being used and the evaluation of f(x,) does not depend on the number 
of decimals in the argument and (2) when /(x,) is being evaluated ex- 
perimentally and the x, are set once for all. 


2.19 Multiple Integrals 


Many of the points raised in connection with multivariate interpola- 
tion are again relevant here. We can evaluate multiple integrals as 
repeated integrals, and this is probably the most satisfactory method for 
occasional use. However, when much multiple integration has to be 
done, more powerful methods should be considered. In addition to 
Irwin’s Tract, there is a considerable body of recent literature—for 
example, the papers by Hammer and his collaborators [33, 59] and 
by Thacher [34]. 

There has been some investigation, particularly by Hsu, of approxi- 
mating to a multiple integral by a line integral, where the line recon- 
noiters the region of integration. 


2.20 The Monte Carlo Method 


This method has been used as a tool in the evaluation of multiple 
integrals. It has been used, essentially, as a last resort, and, at most, 
qualitative crror estimates were given. Acomputational experiment for 
a nine-dimensional integral was carried out by Davis and Rabinowitz 


[35] (see Chap. 4). 
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2.21 Bad Examples 
We mention here some quadratures which present considerable diff- 
culties. 
(a) Tabulate J(x) “ (1 — é?)~-'* dtfor x = 0(.02)1, to 6D without 
0 


using its indefinite integral. This is to be compared with the discussion 
of auxiliary functions in Sec. 2.9. The removal of the singularity by the 
device 


1) =["( -@)* — RO 9p} de + [20 — oa 


and the direct integration of the second term leave only the first term on 
the right to be dealt with; in this the integrand is smooth at x = 1. 


(6) Tabulate the Airy integral Ai(x) = 7” { cos (}4xé3 + xt) dt. 
0 


This quadrature is not feasible, and an appropriate method is to verify 
that Ai(x) satisfies the differential equation y” = xy and to proceed to 
solve this equation numerically (see BAAS 12). 

(c) Hartree [36] has discussed in detail the evaluation of integrals of 
the form 

aed 6 — ysin 6) cos 6 dO 
. = 3B a (x cos 6 — ysin 6) cos 

for a given value of # (about .7) and for x, y varying up to 60 and 40, 
respectively. 


2.22 Integral Equations 


We indicate here methods of solving integral equations numerically. 
For a systematic account, see Fox and Goodwin [37]. See also Prasad 
[38]. 

(a) Consider the numerical solution of an integral equation of the 
Fredholm type 


fla) = atx) + [AOKGs0) at 


for f(x), where g(x) and the kernel k(x,t) are given. We decide that it 
will be sufficient if we know the ordinates f(x;), where @ < x, <x, < 
- <x, <6. Replacing x by x, in the above equation and replacing 


b 
[ feoetxt) a 
by a suitable quadrature 
> A; f(x;), 


we obtain a set of n linear equations 


S(*:) = (ei) + YAS (%); ea? i eres a | 
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The solution of this system will give the f(x;) required. It is usual 
to take the x, as the abscissas of a suitable Gaussian quadrature; the 
determination of f(x) elsewhere is a problem of interpolation. 

For applications in the theory of conformal mapping using automatic 
computers see Todd and Warschawski [64], Stiefel [65], and Todd [66]. 
For a discussion of a nonlinear integral equation, in the same area, using 
desk machines, see Ostrowski [39, 40]. 

(6) Consider now an equation of the form 


fle) = (x) + { “flt)k (xt) at 


for f(x) where g(x) and k(x,t) are given. We propose to tabulate f(x) at 
interval h. If f(x) isknown for x = 0,/,..., (n — 1)h and we approxi- 
mate the integral by the Gregory formula, we get a linear equation for 
f(nh). To apply this idea, it is necessary to determine thc initial values 
of f(x), and the number of these required is twice the order of the appro- 
priate Gregory formula. In many cases it may be possible to obtain a 
power series expansion for f(x) about the origin which will indicate the 
appropriate formula and give the required values. From then on, the 
process is easy, and various conveniences will suggest themselves, espe- 
cially in the case when k(x,t) depends only on x — t. 
As an example, consider the solution of 


yx) =r 43 — [ “()[2(x —t) +3} dt, 0) = 3. 
0 


It is readily verified that the solution to this is 
p(x) = 4e- — e*, 


and this can be used to check the numerical solution to be given. Ifwe 
differentiate the equation, we obtain »’(0) = —7 and, forn > 1, 


y"490(0) = —3y™(0) — 2y"-(0), 
so that 
I(x) = 3 — Tx + Vax? — 86x8 + AWE —--- 
If we consider working to 5D, at an interval of .1, we find that the use 
of the Gregory formula, omitting sixth and higher differences, is appro- 


priate. This means that we must compute 9(x) from the series for 
x = 0(.1)1, and the first value to be obtained will be »(1.1). The values 
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are given in the table below. We rewrite the relevant Gregory formula 
in Lagrangian form: 


[fo dp = 4 f(0) + fl) t +++ + fla — 1) + Vf ln) 

+ Me{Lf(l) — £(0)] — Lf) —f(n — 17} 

— Mea{Lf(2) — 2601) + £(0)] + Lf(n) — 2f(n - 1) 
cee Aen 

+ WoolLf(3) — 3/(2) + 3f0) — f(0}] — Uf) 
SN) ae SI) eS) 

— Meo{Lf(4) — 4£(3) + 6f(2) — 4f01) + f(0)] 
+ [f(n) —4f(n — 1) + 6f(n — 2) — 4f(n — 3) 
pg came 

+ 863 éoas0{[ f(5) — 5f(4) + 10/(3) — 10/(2) + 5f(1) 
— f(0)] — [f(a) — 5f(n — 1) + 10f(n — 2) 
— 10f(n — 3) + 5f(n — 4) — f(n — 5)]} 


=> a, f(t) 


where a), = _ .31559, a, = 1.39218, a, = .62398, 
a, = 1.24408, a, = .90900, az = 1.01427, 
G7 a 1051, 2, 35-45; 
a, = 1 otherwise. 


This expression is valid only in the form writtenforn >11. Applying 
this to our equation, we find, writing 


2x + 3 = g(x), 2(x —t) +3 =k(x —2) 
and using an interval A, 
In = Bn — May Vokn + a Dikn-a + °° + a Fnko)> 
which gives 
Jn = (1 + hagky) Len — h(4o Son + Ika +7 °° 
= Agks Vn—2 ci ak, ¥,—1))- 
In our case, ay = .31559, ky = 3,4 =.1, so that hak, = .094677. A 
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little thought about the progress of the calculation suggests the tabular 
arrangement below. 


x y ay 5.4 3.4 1.2 
0 3.00000 94677 9:2 5.2 1.1 
1 2.37009 3.29959 5.0 5.0 1.0 
2 1.86255 1.16219 4.8 4.8 9 
3 1.45443 1.80943 4.6 4.6 8 
4 1.12700 1.02546 4.4 4.4 7 
i) .86499 87733 4.2 4.2 6 
6 -65596 .65596 4.05708 4.0 5 
7 .48980 .48980 3.45762 3.8 4 
8 35826 35826 4.47869 3.6 3 
9 .25463 .25463 2.12153 3.4 i 
1.0 .17346 .17346 4.45498 3.2 wl 
1.1 .11034 94677 3.0 0 
£2 ak k x 


We begin our calculation by accumulating the 11 products of adjacent 
terms in the central columns of our table to get the term in parentheses 
in the expression 


5.2 —.1(.94677 x 5.2 + 3.29959 x 5.0 +++: + 17346 « 4.45498" 
Ju * 1.094677 
5.2 — 5.07922 
~~ 1.094677 
— 11034. 


We enter this as indicated, and the next step is to displace the right-hand 
half so that 5.4 is opposite .94677 and 4.45498 opposite .11034. We then 
obtain 


S12 = 
5.4 — .1(.94677 x 5.4 +--+ + .17346 x 2.12153 + .11034 « 4.45498: 
1.094677 
= .06167. 
And soon. This should be continued until, say, x = 3 and the result 
compared with the solution 


’30 a 03987. 


Observe that the values of 7 should be differenced as they are obtained 
to ensure that the neglect of the sixth and higher differences of the inte- 
grand is legitimate. 


2.23 Numerical Differentiation 


Various expressions for the derivatives of a tabulated function can be 
obtained by manipulation of the finite-difference operators. Thus, 
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from E = exp AD, we find 
hD = log (1 + A) = A— 4A? + KAS—--.-, (2.16) 
and from 6 = 2sinh (12AD) we obtain 
A2D2 = 62 — Yyod4 He 16968 apy te 


These, of course, can be rearranged in Lagrangian form. 
Alternatively, take the Newton-Gregory interpolation formula 


S(b) = f(0) + pOf(0) + (p> — p) AF(0) 
+ e(p? — 3p? + 2pyA®f(0) + +++, (2.17) 
differentiate it with respect to f, and then put p = 0, 
hf'(0) = Af(0) — 4’A2f(0) + 4Af(0) —---, 


which is (2.16). If we include the remainder term in (2.17), which is of 
the form 


hn 
n! "(é), 

and attempt to differentiate it, we come up against the generally un- 
known behavior of £ as a function of p. It is, however, possible to obtain 
error estimates in the various differentiation formulas. 

Apart from these theoretical difficulties, it is intuitively obvious that 
numerical differentiation is delicate; the round-off errors in tabular 
values will be magnified on division by the powers of / on the left in the 
formula. This suggests that, the larger the interval used, the better. 
Actually, of course, there is an optimal size. This is discussed in detail 
by Kopal ([41], p. 104). We outline the arguments. 

We regard the total error as built up from that due to round-off and 
that due to truncation. The first can be estimated in terms of the pre- 
cision of the tabulation, and the latter depends on the formula used; but 
each depends on the interval A of tabulation. It is clear that the round- 
off error will depend on a negative power of A, and the truncationerroron 
a positive power; in general, therefore, there will bean optimal choice of &. 

To fill in the details of such an argument, we have to specify the pre- 
cision of our tabulation and the formulas and the error norm used. 


DIFFERENTIAL EQUATIONS 


2.24 Introduction 


In this part of the chapter we discuss briefly the integration of ordinary 
differential equations of a simple type and the solution of simple differ- 
ence equations. 
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We have seen in the two earlier sections of this chapter how the use of 
auxiliary functions and the removal of singularities facilitate interpola- 
tion and quadrature. The same is true in the present context. This 
was apparent to Jacobi, who wrote, ““The principal difficulty in the in- 
tegration of given differential equations appears to be the introduction 
of appropriate variables. One must use an inverse process; after finding 
a noteworthy change of variables, one must try to find problems to which 
this will apply.” 

Occasionally the simplest solution will be provided by evaluating the 
analytical solution, either directly or by the use of tables. The com- 
pendium of Kamke [42] is invaluable, followed by reference to the 
indices of Fletcher, Miller, and Rosenhead [43] and of Lebedev and 
Fedorova [44]. But itis easy to construct examples (see [25], p. 80 and 
(5) in Sec. 2.27) where this method is not the most convenient. 


2.25 The Picard Method 


Some of the existence theorems in theoretical analysis are constructive 
in character and can be used to provide numerical solutions. For in- 
stance, the Picard method for the initial-value problem 


Jy = f(%y), (4) = 6 
suggests the definition of a sequence of function y,(x) b 
Jo(x) = 4, 
Insi(x) = 6 +] Fler.(0) dt. 


Under mild assumptions on f(x,y), it can be proved that the sequence 
{ y,(x)} has a limit which is the unique solution of our problem. Since 
we have quadrature methods at our disposal, it would be possible to 
evaluate each y, in turn until a satisfactory approximate solution is ob- 
tained. 

A little experience shows that this method is not a very practical one, 
except in the neighborhood of a. We find that we have essentially a 
two-dimensional solution to a one-dimensional problem; we have to 
tabulate each y, (x). It will be seen that modifications of our quadrature 
process enable us to traverse the interval (a,x) once only. 

Consider the solution of »’ = 2xy, y(0) = 1, in particular the deter- 
mination to 4D of y(.5) andy(1). [The actual solution is y = exp ( —x?).] 
Instead of carrying out the quadratures numerically, we can do them 
analytically in this case and obtain 7, = 1, 9, = 1 — +, », =1 —- 
x2 4+ Vox4, y, = ] — x? + Voxt — lex8,..., the initial segments of the 
power scries for y. It is clear then ate even if our quadratures were 
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exact, it would take about four quadratures to get _y(.5) and about eight 
to get »(1) to the required accuracy. 


2.26 Local Taylor Series 


Whenever the differential equation is such that it is easy to obtain the 
nth derivative of the solution—for example, by a recurrence relation—the 
following method is very suitable. It has checks at every stage, and 
large intervals can be used. We discuss two examples. 


(a) y=y—x, 90) = 3. 

The solution is y = e* + x? + 2x +2. Here 

yy =) 2m, x?: y =y — x2 — 2x, ay?) = yl4) — eee = Vy oo, x2 os, ox — 2. 

We use an interval A = .3 and compute to 5D: 

x= 0 XS .3 *-226 

y 3.00000 4.03986 9.38212 
hy’ .90000 1.18496 1.50664 
hy” /2! .13500 .15074 .17200 
h3y'3) 13 .00450 .00607 .00820 
hty(4)/4! .00034 .00046 .00061 
hdy'5) 15! .00002 .00003 .00004 
XZ Ary!" /n! 4.03986 5.38212 7.06961 
E (—A)"y(™/n! 2.23082 3.00000 4.03985 


Adding the first six entries in the first column we get »(.3) ; if we add with 
alternate signs, we get y(—.3). From the computed value of _y(.3), we 
obtain the first six entries in the second column, using the expressions for 
iy”. We then obtain »(.6) and _»(0)—the latter checks with the initial 
value. From _»(.6) we build up the third column, and so on. 

If we attempt to use a larger interval, say A = 1.5, we find we need to 
take 11 terms instead of 6. 

(b) (x? — 1)y” + 2x7’ —6y = 0, ~—_-9 (6) = .00063 2330, 

y'(6) = —.00032 1299. 

The solution to this is y = Q.(x) = (32x? — 4) log [(x + 1)/(x — 1)] — 
3ex. Writing 7, = h"y'")/n!, we find 

(x? — I)r,42 + [2hx(n + 1)/(n + 2)] thar 

+ [h?(n — 2)(n + 3)/(n + 1)(n + 2)]7, = 0. 


We can use A = .1 and integrate from x = 6 to x = 7 and compare the 
results obtained with the correct solution: 


(7) = .00039 5644, (7) = —.00017 1573. 
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In this case, in addition to computing & (+1)"7, to predict and to correct 
J, we have to compute 2 n(+1)"7, to predict and correct_y’, for our re- 
currence relations are three-term ones. 


2.2/7 Predictor-Corrector Methods 


We describe a rather primitive method of the predictor-corrector type, 
which shows the idea; for more elaborate methods, see, for example, 
Milne [45]. nh 

If we write [(n) | f(t) dt and f(n) = f(nh), two earlier results can 

0 


be stated in the following form: 


Milne: I(4) — 1(0) = 4h(2f(1) — f(2) + 2f(3)]/3, 
with an error of 14A1;/5/45, 


Simpson: I(4) — 1(0) = 2h[ f(0) + 4f(2) + f(4)]/3, 
with an error of —16A4,/°/45, 


where M,, Mj are values of f (x) at some intermediate points. Con- 
sider the differential equation 
J'(x) =f(%3), — -9(0) = 1(0) 

and suppose that we have obtained (adequate approximations to) 
fl) =f(A,y(h)), f(2), f(3). Then, using the first formula, we can pre- 
dict /(4) = _»(4). Using this, we can compute /(4) and then use the 
second formula to obtain another estimate for y(4). Ifthe two values of 
(4) do not disagree too violently, we can accept their mean as a final 
value—for the errors are almost equal and opposite—and proceed. In 
the event of violent disagreement, a change to a smaller & is indicated. 

We discuss two examples. 

(a) Solve y’ = x + y, 9(0) = 1 for 0 <x <1, to 4D. The correct 
solution is y = 2e*7 — 1 — x, sothat_y(1) = 2e —2 = 3.43656.... Using 
an interval h = .1 and assuming that 9(.1), »(.2), y(.3) are known, our 
computation begins as follows: 


x y Y=f(sy) =x ty 
0 1.0000 1.0000 
! 1.1103 1.2103 
2 1.2428 1.4428 
3 1.3997 1.6997 
4 1.5836 1.5836 1.9836 
5 1.7974 1.7974 2.2974 
6 2.0442 2.6442 
7 
8 
9 
1.0 
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(6) Solve the equation y’ = x? — y?,»(0) = 1,forO <x <1. Itcan 

be shown that 
Cee PM), (ax?) + 20 (34) 1, (42x?) 
‘d PM), (Yax*) + 20 dT, (Yer) ’ 
and using tables of the fractional Bessel functions (NBSCUP 10, 11), we 
find 
y(1) = .75001 5703. 

Instead of using the corrector noted above, we could use 
I(4) — 1(2) =ALf(2) + 4f(3) +(4)]/3 with an error of —M7h°/90. 
We can now observe that, if the difference between the predictor and the 
corrector does not exceed 15 units (in the last place, e.g.), then the error 
in the corrector may be estimated as about half a unit. 

These methods can be applied to vectorial equations. Forsimplicity, 
consider the system 

x = yfe(t), J = xf, (t), x(0), (0) given, 

where dots indicate differentiation with respect to ¢t. We assume that 
x, x,y, 9 are known for ¢ = 0, 1, 2, 3. Then we proceed as follows: 

1. Predict x(4) by Milne. 
. Compute 7(4) from » = xf,(t). 
. Predict _y(4) by Milne. 
. Check »(4) by Simpson. 
. Compute x(4) from * = yf,(¢). 
. Check x(4) by Simpson. 


2.28 Euler, Heun, and Runge-Kutta Methods 
One of the most naive approaches to the numerical solution of 
y = f(x,7),9(%9) =o iS associated with Euler. It consists in defining 
the solution by the relation 
x(n $1) = p(n) + hf (nh, y(n)), 2 S03 90) =p. (2.18) 
If we assume that_y can be expanded as a power series in 4, it is clear that 
the local error is O(h?). This means that we might expect a total error 
of about O(h), for over a fixed range we would have O(A-!) steps, and if 
we assume the errors additive, we have O(h-1)O(h?) = O(h). This is 
therefore not a very practical method; we note, however, that it requires 
no special starting devices. 

A slightly more complicated method is due to Heun; it has a smaller 
error but retains the advantages of requiring no special starting devices. 
It consists of the following scheme, for n > 0: 

y*(n + 1) = p(n) + hf (nh, y(n), 
y**(n + 1) = p(n) + hf((n + IA, y*(n + 1D), (2.19) 
y(n +1) = (y*(n + 1) + 9**(n + 21). 
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Again, assuming the existence of a power series expansion for 3, we find 
that the local error is O(A?); more specifically, it does not exceed 


Yeh? |y"(6)|, nh SE <(n + IA, 


if we assume that f(x,y) is independent of_y, that is, that we are carrving 
out an indefinite integration. If we allow for the dependence on }, the 
above bound must be replaced by 


of 
wt = 3 rad (2 
ons) yrs) Oy Ks.n) 
This means that the total error, in favorable cases, is O(4?), which makes 
the scheme a feasible one. 
We note that this scheme is likely to be more efficient than a straight- 
forward generalization of the Euler process. For if we used 


J(n +1) = y(n) + hf (nhyy(n)) + 72k fe (mh,y(n)) +f, (mhyy(n)) fF (ah, y(n), 


we should have to evaluate /, f,, f, at each step instead of f twice if we 
used (2.19). 

The above error estimates do not hold whenever the solution cannot 
be expanded as a powcr series. If we take a trivial case 


Yoh3 


y= x, 
we see that the local error is O(h"*), not O(A?). It is instructive to study 
the following classical example in the range [0,1], 
y=Vxt vy,  »(0) =0, (2.20° 
which has a solution of the form 
y = x4 4 (26) x4 + Vex? + 9 (23)'tx"* — Hast ++: 


For a recent discussion, see Richter [46]. In order to make use of auto- 
matic computers which recognize numbers x, |x| < 1 only, we change 
the scale in each variable in (2.20) by a factor 4, and we shall discuss, 
instead of (2.20), 


y =Kve+hvy,  3(0) =0 


inthe rangeO <x <.25. Weconsider the use of Heun’s method, with 
a variable hk. The results of this integration, for 40002 = 1, 5, 10, 50, 
have been recorded in Sec. 4.22. 

Should the Heun method be unsatisfactory, it is natural to try to get 
a similar method with a smaller local error. One such method is the 
Runge-Kutta method, which consists in writing 


y(n +1) = y(n) + 4[K(1,n) + 2K(2,n) + 2K(3,n) + K(4,n)] 
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) = hf[nh,y(n)] 

K(2,n) = hf[(n shied »I(n) + AK(1,n)] 
K(3,n) = Af[(n + 72)h, 9(n) + 72K(2,n)] 
K(4,n) = hf((n + 1h, y(n) + K(3,n)]. 


This method is discussed in detail in Chap. 9 (see also Gill [48] and 
Martin [47]). For the moment, we observe that, when fis independent 
of _y, it specializes to the use of the Simpson rule for the quadrature, that 
the local error is O(45), and that four evaluations of f(x,y) are required in 
each step. ‘Thus the error here is of the same order as that of the 
predictor-corrector methods discussed in Sec. 2.27. Wecan give herea 
rough comparison between the methods. 

First of all, it can be shown that in many representative cases the A 
appropriate for the Runge-Kutta method is much greater than twice 
that appropriate for the Heun method, so that the Runge-Kutta method 
is more efficient, although it requires about twice as much calculation 
per step. To compare the Runge-Kutta method and the Simpson- 
Milne method, we note that the local errors in the first can be estimated 
as hb y'5)( £) /2880 (when jf does not depend on »), which is to be compared 
with —/Py(é)/90 in the second method. The factor of 32 by which 
they differ implies that the interval A in the first case can be about twice 
that in the second. This means that the total amount of calculation 
required is about the same. 

Summarizing, we may say that, although the Runge-Kutta method 
has no checks, it is probably to be preferred for automatic calculators, 
whereas the Simpson-Milne method is to be preferred for desk computers 
because of its checks, despite the difficulties at the beginning (which re- 
appear if the interval has to be decreased in the course of the calculation). 
However, if many desk calculations have to be done, it will be worth- 
while investigating some of the more powerful finite-difference methods, 
such as those discussed in IAT. 

We note that it is possible to try the A? extrapolation discussed in 
Sec. 2.33. We can carry out a Heun integration at interval A, then one 
at interval }2A, say, and then improve each ordinate by eliminating the 
hk? component in the error. 


where K(1,n 


2.29 Boundary-value Problems 


So far we have discussed initial-value problems only. There are, 
however, various boundary problems for ordinary differential equations 
which are of practicalimportance. Many ofthese are discussed in detail 
in the texts of Collatz and Fox. We mention briefly two problems. 
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(a) Find a value of A such that there is a nontrivial solution of the 
equation 
J” + Axy = 0, — -9(0) = 0 = r(1). 
One method of trial and error for this is to guess a value of A, say 4,, 
and solve the resulting initial-value problem, obtaining a value /, at 
x = 1. Weneed to choose some initial slope, which we may take to be 


J1 = 1. 
A reasonable guess for A, can be obtained in the following way. Dis- 
cretize the problem at interval k = \% to get 


5(0) — 25(%) +9(2) = —He ay“) A 

I(YA) — 2972) + 9%) = —Ne + May) A 

3(¥4) — 29(%A) + 9(1) = —Ae - 4y(%) A, 
where we put 7(0) = y(1) = 0. This gives a cubic for A (see Collatz 
[49a], p. 373; [49c], p. 299). 

Having obtained f, from 4,, we take another A, and compute the 
corresponding f,. Using these two values, we interpolate and get a new 
approximation A, for A. We then obtain f,and carry on in this manner. 

The exact solution to our problem can be obtained in the following 


way. The solution to the problem satisfying the left-hand boundary 
condition only 1s 


as x4, (26 Ax’4) 
If this is to vanish for x = 1, we must have (NBSCUP 10, p. 385) 
26VA = 2.9025...,6.0327...,.... 
The least value of A is therefore 
A = (4.3438)? = 18.9563 .... 


(6) Find the least value of 4 for which there is a nontrivial solution of 
the relations 
y+ hy =0, 90) = 9 =3(7). 
It can be shown that, if u(x) is a function satisfying the boundary con- 
ditions, then the required value of 4% is the minimum value of the 
Rayleigh quotient, 


R(u) = = | w ax | [us dx, 
0 0 


and the function u which minimizes R(u) is the required solution. Thus, 
if 
u, = sin x, we get R(u,) = 1; 
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if we take 
uy = x(x — 7), we get R(u,) = 10/m? > 1; 


and if we take 
Uy = x(x +7)(x— 7m),  weget R(us) = 21/2n? > 1. 


The development of this idea into a practical method requires consider- 
able experience. 


2.30 Stability of Numerical Processes for the Solution of 
Ordinary Differential Equations 

The stability of numerical solutions for differential equations is a 
rather delicate subject and will be treated in more detailin Chap. 9. It 
is, however, important that the possibilities of catastrophes should always 
be borne in mind, and it is our purpose to illustrate this in a simple case. 
It is convenient to interpolate a brief account of the solution of ordinary 
difference equations with constant coefficients. 

There is a vast theory of difference equations, and some of it is parallel 
to the theory of ordinary differential equations. In particular, the 
familiar theory of ordinary differential equations with constant co- 
efficients carries over in a natural way. 

The solution to the first-order difference equation 


Us =k, Uy given 


is manifestly 


U, = Uk". 


n 


A second-order difference equation can be written, without loss of 
generality, in the form 


Unse — (% + B)Unsa + afu, = 0, 
which can be rearranged as 
(Unig — Oys1) — BlUay, — au,) = 0 (2.21) 
oe, (Unie — Bulnsi) — %(Unya — Buy) = 0. 


We must now distinguish between (a) the case in which « # f and (6) the 
case in which a = Bf. Incase (a) we can regard (2.21) as a first-order 
equation for v, = u,,, — au,, so that we find 


Ugg =O = UEP". (2.22) 


Similarly, wecan regard (2.21) asa first-order equation for w, =u,,,— Bu, 
and obtain 


Uns, — Buy = Won". (2.23) 
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Eliminating u,,, from (2.22) and (2.23), we find 
u, = Aa" + BB". 
This method breaks down in case (4). We now obtain only one 
equation in place of the pair (2.22) and (2.23): 
Uni, — Wy = Vga” (2.24) 
If we now introduce w, = u,a—", this becomes 
Was. — Wa = Ugr? (2.25) 
Summing these last equations, we find 
W, — Wy = niga}, 


so that 
w, =A+ Bn 


and u, = (A + Bn)a". 
The extension of these results to equations of higher order is evident. 
The solution of an initial-value problem 
=I, (0) = 9, -9"(0) = 1 (2.26) 
is essentially reduced to the solution of a recurrence relation for y(n) = 
_y(nh). We can replace the continuous problem (2.26) by a discrete 
one if we replace the second derivative by h-?[_ y(n + 2) — 2y(n + 1) + 
J({n)]: 
J(n + 2) = (2 — h)y(n+ 1) — y(n), 90) = 9, (2.27) 
where a suitable value of _y(1) is to be assigned. 
The solution of (2.27) for A = 1 with »(1) = 1 is 


y(n) = 0,1, 1,0, —1, —1, 0, 1, 1,0,.... 


If we take A = .1 and y(1) = sin .1 = .09983 and work to 5D, we get 
the results in column (2) of the table below, which are to be compared 
with the 10D values of sin x in column (1). We are therefore getting 


about 3D accuracy. 
If we require more accurate results, it seems natural to take a better 
approximation than the simple h?y” = 6%y which led to (2.27). Ifwe use 


h2y" = (62 — Wed4)y, (2.28) 
we get, fork = .1: 


y(n + 4) = 16y(n + 3) — 29.88y(n + 2) + 16y(n + 1) — p(n), 
(9) = 9, (2.29) 


where suitable starting values of y(1), (2), _»(3) must be assigned. 
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In column (3) we have taken (1) = sin 2/10, 7 = 1, 2, 3 and worked 
to 10D, and in column (4) we have again taken the correct starting 
values but worked to5D. These are the catastrophes at which we have 
hinted and which we shall discuss presently. 

Another method replaces the 64 in (2.28) by its approximate value 
—h?6?; in this way we obtain a three-term recurrence, 


(12 + A?) d2y = —12hA?y, 
which, in the case k = .], is 
y(n + 2) = 1.99000 83333y(n + 1) —_p(n). 


The results obtained by the use of this method are recorded in column 
(5) and are good to 6D. 


(1) (2) (3) (4) (3) 


0 .00000 00000 .00000 .00000 00000 00000 .00000 00000 
1 .09983 34166 .09983 .09983 34166 © .09983 = .09983 34166 
2 . 19866 93308 . 19866 .19866 93308 .19867  .19866 93310 
3 .29552 02067 .29550 .29552 02067 29552. .29552 02077 
4 .38941 83423 38939 .38941 83685 .38934 = .38941 83450 
a) -47942 55386 47939 47942 59960 47819  .47942 55440 
6 .96464 24734 .96460 .56464 90616 94721 = .56464 24828 
at -64421 76872 -64416 .64430 99144 40096 = .64421 77021 
8 .71735 60909 .71728 .71864 22373 = =—2.67357 — 71735 61128 
2 . 78332 69096 78323 80125 45441 .78332 69403 

1.0 .84147 09848 84135 = 1.09135 22239 84147 10261 

1.1 89120 73601 89106 4.37411 56871 89120 74139 

1.2 .93203 90860 93186 93203 91543 

1.3 .96355 81854 .96334 .96355 82701 

1.4 .98544 97300 98519 .98544 98328 

1.5 .99749 49866 99719 .99749 51092 

1.6 .99957 36030 99922 .99957 37469 


A detailed examination of these results has been given by Todd [51]. 
At present we note only that the solution of the difference equation (2.29) 
is of the form 


y(n) = Aw" + Ba” + Ccos n6 + Dsin nO 
where, to four figures, 
a = 13.94, 6 = .1000. 


The required solution is thatin which A = B=C =0,D=1. How- 
ever, it is clear that, if any component of the solution Aa” enters, it will 
swamp all the rest. This is indeed what has occurred in (3) and (4); in 
one case, we get a component with an A > 0, and in the other, one with 
an A <Q. These are brought in during the determination of (4). 
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MISCELLANEOUS DEVICES 


2.31 Introduction 

Several interesting processes have been devised to facilitate com puta- 
tions—for example, the Aitken 6? process mentioned in Chap.1. [hese 
were originally conceived for use with desk calculators, but as the cus- 
tomers of the automatic computer became more demanding, such proc- 
esses have been taken over by the more powerful equipment, and the 
methods have been developed further. In addition to the work de- 
scribed here, there have been many investigations by Airey, Bicklev, 
Miller, and Wynn. 


2.32 The Aitken 6? Process 


The Aitken 6? process is described in Chap. 1. We note here that it 
has been developed recently by Shanks, Lubkin, and Wynn. 
The following example is suggested for study. Find the first few con- 


vergents for V 19 = 4.3588989 - - - from the (periodic) continued fraction 


= 1 .o.d.14 
NO a ac 
ome ee we eee we 


and apply the 6? method to them. 


2.33 Richardson’s /? Extrapolation, or Deferred Approach 
to the Limit 


The idea of Richardson’s h? extrapolation is somewhat similar to that 
ofthe Aitkenscheme. Ingeneral, let d(x) be the solution to a continuous 
problem and ¢(x,h) be the solution to a discrete version of the problem, 
where & indicates the mesh size. In some circumstances, we may have 


P(xh) = P(x) + APpa(x) + Re 
and if AR, is negligible and if we solve the discrete problem for two values 
of h, say A, and fy, we have 
P(xh1) = B(x) + hy?ho(x) 
$(x,h2) = $(x) + hy%d2(x). 
From these we can eliminate ¢, to get 


hy*h(x,h,) — hy?d(x,hy 
(x) = ( y “s ) ; 
hy" — h, 


This plausible device can be applied in other contexts—for instance, 
when ¢ is a constant or when ¢ is a function of several variables. We 
discuss a simple example. 
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Consider the extrapolation for the circumference of a circle of radius 
unity, from the lengths of the inscribed squares and hexagon. We have 
an estimate of 42 corresponding to a mesh length of 4 and anestimate 
of 6 corresponding to a mesh length of 4. This leads to 


1 1 
146(4V.2) — 6(6) eer 
2(436 — Me) 
as an estimate for 7. 
We note that this method can be applied in the case of the partial 
differential equation discussed in Chap. 4. It cannot be applied in the 
case of the ordinary differential equation discussed in Sec. 2.28 above. 


For a discussion of other cases when the method is inapplicable, see 
Wasow [52]. 


2.34 The Euler Transformation 


The Euler summation method is essentially a transformation of one 
infinite series into another. It is most convenient to exhibit it in the 
case of an alternating series. We proceed formally, using the finite- 
difference operators: 


Up — UW tu —'s' = (1-—L£+ EF —>::-)u = (1 4+ E)o uy, 
= (2+ A)"luy = (1 + WA), 
= Vou, — Madu, + Au, — MeA%u, +°°:. 
We show the efficacy of this by considering the evaluation of In 2 from 
the power series: In2 = 1—%+% —-::. Itis possible to apply 


the Euler transformation directly to this series, but it is more convenient 
to apply it to the tail. We obtain 


1—4%+---—\% = .63452 3809, 
and we difference the sequence 4, Mo, M1, Me,... thus: 
-LL111 1111 
—.O1111 1111 
. 10000 0000 + .00202 0202 
— .00909 0909 —.00050 5051 
.09090 9091 +. ,00151 5151 + 00015 5402 
—.00757 5758 — .00034 9649 — 00005 5505 


.08333 3333 


The next leading differences are +.00002 2212, —.00000 9740,.... 
Using all of these, we find 


wm 


>(—1)"-'/n = .05555 5556 + .00277 7778 + .00025 2525 + .00003 1566 
" + .00000 4856 + .00000 0867 + .000000174 + .00000 0038, 
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giving 
In 2 = .69314 7169, 


which we compare with the true value 
In2 = .69314 7181. 


There does not seem to have been any study of the optimal division of 
a series into a head and a tail. 

We remark that the transformed series always converges if the original 
one does, but not necessarily more rapidly. 


2.35 Asymptotic Expansions 


The concept of asymptotic series 1s used essentially in numerical analy- 
sis. We shall discuss a rather primitive case in some detail. The 
subject is discussed incidentally in many books, and there are several 
monographs available (see, e.g., Erdélyi [53]). 

Whenever the (numerical) behavior ofa function over aninfinite range, 
say 0 <x < 00, is sought, a direct tabulation does not come in question. 
One answer to this problem is to change the dependent variable—for 
example, to x-! or x-? (see Sec. 2.9). Another is to obtain a simple 
approximation to it, valid for large x, say x > x9, from which the behavior 
is obvious and from which the function can be readily calculated to the 
relevant precision; this analytical expression will replace the tabulation 
in the infinite range x, < x. 

Consider 


f(x) = { “tle! dt = —eEi(—x) 


for x > 0, where the integral is manifestly convergent. By successive 
integration by parts, we find 

l ] —1)""1!(n — 1)! 
f(x) =--a4+°°° a com) aie eee 


x x* x” 


io¢) 


+ (—1)mat| t-"-eF—t dt, 


If S,,(x) is the sum of the first n terms on the right andr, (x) isthe last term, 
we have 


r(x)| = (fe) =—S,) |= a ae dt 
=! aed Cc ara dt 


< nix ra er—' dt 
Zz 


nl x-M-If — ett] 


=n! x-"-1, 
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We want to study this estimate forr,(x). First, it is clear that, if we fix n, 
we can make this remainder as small as we please if we take x sufficiently 
large. Secondly, consider, more practically, the behavior of our esti- 
mate as a function of n, with x being fixed. We have 


nix") = (nx“))[(n — 1)! x], 


so that our estimate for |r,(x)| decreases as n increases from | to [x], the 
integral part of x, and then increases to co. Thus, for a fixed value of x, 
there is a limit to the (estimate for the) accuracy by which we can ap- 
proximate {(x) by taking a number of terms of the series 


¥ (=1)*al x, (2.30) 


This is to be carefully distinguished from what happens in the case of a 
convergent series 


> u,(x) = S,(x) + 7,(x), 


where [r,,(x) | can be made ae small by taking n sufficiently large. 
Indeed the series (2.30) is not convergent for any x, since its general term 
does not tend to zero. The series has the property that 


lim x"[ f(x) — S,(x)] = 0, PSA; 2, as05 0. 


Such a series is said to be asymptotic in the sense of Poincaré, and the 
relation is denoted by 
F(x) mx? — x2 4 21 x3 — BI x 4 + --- 
In general we write 
F(x) ~ Ay + Ayx7! 4+ Agxm? +:°°-, x— ©, 
whenever 
lim x"[F(x) — (Ay + Ax? + +--+ + A,x-")] = 0, 
A= O01, 2. eceen > 0. 

When this is the case, it is clear that 

lim F(x) = As 

lim x[F(x) — Ao] =A, 

lim x?(F(x) — Ay — A,x71] = A, 


One of the most useful asymptotic formulas is that of Stirling, which is 
properly written in the form 


(—1)48, 


log F(z) — (z — %) logz+ z~ Mlog 2a + Pe eee 
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It is conventional to write this as 
oO oo Let pb 
ee ee = 4 ee ea 
log T(z) ~ (z — 4) log z — z + Me log 2a + >> (De 


a Qa ] ] 
oT) wert fZ(1 +5: + men 7) 


The latter is valid for complex z, provided 


larg z| <a — 6, forsome 0 < 6 < 7. 


Asymptotic series can be manipulated fairly freely, but precautions are 


Estimate 


va (x) 


Re ee 


Oo 1 2 3 4 5 6 7 8 9 10 


Fic. 2.5 Behavior of actual and estimated error. 


necessary. One of the unexpected phenomenais that of nonuniqueness, 
because, for instance, 


et~O0+0-x7° 4+0-x°%7 +:°2-, 


One of the standard methods of deriving asymptotic series is that of 
integration by parts. For other methods we have to refer to the litera- 
ture. 

We note that in some cases, in particular the one discussed above, the 
error at any stage is less than the first omitted term. We note also that 
the optimal value of n obtained by consideration of the behavior of our 
estimate for |r,(x)| need not be the real optimal. This 1s made clear by 
Fig. 2.5. 

To summarize our discussion: it is clear that asymptotic series can be 
a useful device in the evaluation of functions provided they can give the 
required precision but, if this is not available, other methods must be 
used. | , 

We shall now examine a numerical case, f(x), forx = 15. Werecord 
on the left the multiplier 2/x = n/15, which produces the (n + 1)st term 
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from the nth. In the center the actual terms are recorded and on the 
night the partial sums. 


.06 .06666 66667 .06666 66667 
13 — .00444 44444 06222 22293 
2 +.00059 25926 .06281 48149 
.26 — 00011 85185 .06269 62964 
3 + .00003 16049 06272 79013 
4 — 00001 05350 .06271 73663 
.46 +-.00000 42140 .06272 15803 
53 —.00000 19665 .06271 96138 
6 +.00000 10488 .06272 06626 
6 — .00000 06293 06272 00333 
73 + .00000 04195 .06272 04528 
8 — .00000 03077 .06272 01451 
86 + .00000 02461 .06272 03912 
93 — .00000 02133 06272 01779 
l. +.00000 01991 06272 03770 
1.06 — .00000 01991 06272 01779 
1.13 + .00000 02124 .06272 03903 
— .00000 02407 .06272 01496 


Our arguments suggest that the best estimate for f(x) is that given on 
the fifteenth line. As a check on our arithmetic, we can compute the 
last term directly as 

14! 8718 x 10" 
1515 4.3790 x 10}? 


which checks the recorded one. 
The value given in NBSAMS 51 is 


= 1.9909... x 10~’, 


062720 
and that obtained from Coulson and Duncanson [58] is 
.00000 00191 8628... x 3269017.372... = .06272 028.... 


Experiments showed that the Euler process applied to asymptotic 
series gave sensible results. We find, for instance, that the sum of the 
first four terms in f(15) is 0.06269 62966 and that the Euler sum of the 
tail is .00002 39226, which gives 


f(15) = .06272 02190. 
If we sum the first eight terms before applying the Euler process, we find 
f(15) = .06272 02790. 
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These processes were justified by Rosser in a paper which contains much 
other valuable material [54]. 

For another worked example, see Goodwin and Staton [55]. 
2.36 Recurrence Relations 

The Bessel functions J,,(x) satisfy the recurrence relation 


Sil) om 2nx-1J,,(x) ae J,,-1(%)- 


Let us consider the evaluation of J,9(1) from this, starting from the 
values of J,(1), J,(1). We find in succession 


Jy = .76519 76866 
J, = .44005 05857 
Jy = .11490 36848 
Jy = 01956 33535 
J, = .00247 66362 
J; = .00024 97381 
J, = +.00002 09504 
J, = —.00000 16456 


To go farther is pointless, for J;(1) = +.0000015. The reason for this 
loss of significance is the factor 2nx—} (any error in J; (x) is multiplied by 
2 in obtaining J,, the error in J, is multiplied by 4 in obtaining J, and 
so on) together with a cancellation effect. 

It was observed by Miller (BAAS 10) that sensible results can be ob- 
tained by using the recurrence relation backward, that is, in the form 


Fe = QnJ, ae S der 
even though we begin with arbitrary values for J,, J,,, for a large n. 
We shall illustrate this. 


If we choose, for example, “J”’,, = 0, “J’’55 = 1 x 10-8, we obtain 
successively 


2 er — 0 

“J, = .00000 001 

“J”,, = .00000 078 

“J, = .00005 927 

“J, = .00438 520 

«J = 31567 513 

“7, = 43716 258 x 10%, 


This is, of course, not the correct result, because any multiple of the 
Bessel functions satisfies the recurrence relation, since it is homogeneous. 
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To determine the appropriate scale factor, we can consider continuing 
the recurrence down to obtain “J”. Comparison of this value with the 
actual value of J, indicates what scale factor is appropriate. This gives 


Jeo = 3.87350 30 x 10-25, 


Another way of determining the scale factor is to use the relation 
Jo(x) + 2 > Som(x) = 1. 
m=1 


This method is applicable in many circumstances and is convenient for 
use on automatic computers, not only for the Bessel function J,(x) but 
for various related functions and for such functions as the Coulomb wave 
function (see Stegun and Abramowitz [56, 57] and Gautschi [60]). 


MATHEMATICAL TABLES 
2.37 Introduction 


The purpose of this section is to give some advice on the choice and 
use of mathematical tables. Although it is clearly not possible to cover 
the detailed needs of all scientists—mathematicians, physicists, chemists, 
engineers, astronomers, and statisticians, for instance—it seems possible 
to give information which will satisfy most users of tables and to indicate 
sources from which they may obtain either more elaborate or less 
elaborate tables, along with appropriate instructions for their use. 

Our main concern is with mathematical tables in the strict sense. 
We have confined our references to reasonably accessible publications. 
We have also restricted ourselves, on the whole, to first-class tables. 
Among the qualities desirable in a table are: 

1. Reliability. Various standards of precision are admissible, but 
these should be stated and maintained. 

2. Legibility. This depends on the method of reproduction, the com- 
positor or designer, and the type available. 

3. Convenience. Here we are concerned with the arrangement and 
the ease of interpolation: information should be given about recom- 
mended methods and the degree of accuracy which they provide. 

It is convenient to distinguish among the books which should be im- 
mediately available (i.e., on one’s desk), those which should be available 
in a local library, and those which one should know exist and should 
know how to obtainonloan. These can roughly be classified as follows: 

1. Collections of miscellaneous 3D to 6D tables, together with more 
extensive tables of special interest. Bibliographical material. 

2. Tables of medium accuracy, 6D to 8D. _ Bibliographical material. 
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3. Fundamental tables to 8D or more and tables of very special 
functions. 

Many computers find it convenient to use fundamental tables because 
the fine argument interval may make it possible to avoid interpolation. 


2.38 Short List 


In the first class we include the following: 


P. Barlow, ‘“Tables of Squares, Cubes, ...,’’ L. J. Comrie, ed., London, 
1941. 
E. Jahnke, F. Emde, and F. Losch, ““Tafeln hoherer Funktionen,” Stutt- 
gart, New York, 1960. 
L. J. Comrie, “Chamber’s Six-figure Mathematical Tables,” vol. IT, 
New York-London, 1949. 
or 
“Chamber’s Shorter Six-figure Mathematical Tables,’ New York- 
London, 1950. 
or 
F. Lésch, “‘Siebenstellige Tafeln der elementaren transzendenten Funk- 
tionen,”’ Berlin, 1954. 


There is in preparation a comprehensive volume, 
“Handbook of Functions,” M. Abramowitz and I. A. Stegun, eds., 


which will appear in the National Bureau of Standards Applied Mathe- 
matics Series. 

Those who are at all concerned with statistics, either in the design of 
experiments or in the analysis of the results of experiments, will add to 
this list such collections as: 


R. A. Fisher and F. Yates, “Statistical Tables for Biological, Agricultural 
and Medical Research,’”’ London, 1943. 

A. Hold, “‘Statistical Tables and Formulas,’? New York, 1952. 

E.S. Pearson and H. O. Hartley, “Biometrika Tables for Statisticians,” 
vol. I, London, 1954. 


2.39 Bibliographical Material 
Every organization whichis at all concerned with computation should 
have available: 


A. Fletcher, J. C. P. Miller, and L. Rosenhead, ‘‘An Index of Mathe- 
matical Tables,’? London—New York, 1946. 


This volume, to which we shall refer as FMR, gives a comprehensive 
bibliography of mathematical tables, up to 1944. A revised edition is in 
preparation. 
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In 1956 there appeared a Russian index: 


A. V. Lebedev and R. M. Federova, ‘‘Spravochnik po matematicheskim 
tablitsiam.”’ 


In 1959 a supplement to this was issued: 
N. M. Burunova, “Dopolnenie,”’ no. 1. 


These can be used even with a very small knowledge of Russian. 
The specialist in number theory will require: 


D. H. Lehmer, ‘‘Guide to Tables in the Theory of Numbers,” U.S. 
National Research Council Bulletin 105, 1941. 


‘An Index of Tables for Statisticians,” J. A. Greenwood and H. O. 
Hartley, eds.,isin preparation. It will be divided into Part I, Statistical 
Tables, and Part II, Selection of Mathematical Tables of Interest to 
Statisticians. 

Mathematical tables are covered by the standard reviewing Journals: 

Zentralblatt fiir Mathematik, 1931- 

Mathematical Reviews, 1940- 

Referativnyi Zurnal, 1955— 

Computing Reviews, 1960- 
but above all in the U.S. National Research Council quarterly, 

Mathematical Tables and Other Aids to Computation, 1943-1959, 
which has been renamed 

Mathematics of Computation, 1960-. 

The last journal, 

Communications, Assoc. Comput. Mach., 1958-, 

and Numerische Mathematik, 1959-, 
include special accounts of essential equivalents to tables: (rational) 
approximations to functions, and basic subroutines for automatic 
computers. Another feature of Afath. Comp. is monographs on special 
classes of functions; for cxample: 


H. Bateman and R. C. Archibald, A Guide to Tables of Bessel Functions, 
Math. Tables Aids Comput., vol. 1, pp. 205-308, 1943-1945. 

A. Fletcher, Guide to Tables of Elliptic Functions, Math. Tables Aids 
Comput., vol. 3, pp. 229-281, 1948. 


2.40 Notation 


A convenient notation for the contents of a table has been established. 
By 
Pacor Ra hb, 6D 
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we understand a table of f(x) to 6 places of decimals, for values of x at 
intervals A, between a and binclusive. Thusatableofsin x, x = 0(.02)1, 
6D begins in one of the following forms: 


x sin x sin x 10® sin x 
0 .000000 000000 000000 
.02 .019999 019999 019999 
.04 .039989 039989 039989 
.06 .059964 059964 059964 
.08 .079915 079915 079915 
.10 .099833 099833 099833 
12 .119712 119712 119712 


Occasionally, we find 
Bacar x = a(h)b, 8S, 


indicating that the function is given to 8 significant figures. 
The following is the beginning of a table to 8S of ['(n + 12) for 
n = 1(1)1000. 


n Tin + VW) 

0 1.77245 39 = (0) f 
8.86226 93 (—1) 
2 1.32934 04 (0) 
3 3.32335 10 (0) 
4 1.16317 28 (1) 
5 5.23427 78 (1) 
6 2.87885 28 = (2) 


The meaning of extensions of this notation, such as 


f(*)s 


is apparent. In FMR a notation has been established which indicates 
what provision is made for interpolation—whether proportional parts, 
reduced derivatives, central differences, or modified differences. 

Apart from tables in this conventional form, we call attention to the 
existence of “‘critical’’ tables which are appropriate for the quantitative 
description of slowly varying functions. The table of contents of a book 
is essentially a critical table, with the page number as argument. We 


x = 0(.001)1(.01)2(.1)3(1)50 (various) 1000 


+ The figures in parentheses indicate the power of 10 by which the given value 
must be multiplied. Thus 
P'(34) = .88622693, 


1'(134) = 287.88528. 


ad 
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give here an example of a critical table, to 3 decimal places, of the 
Everett coefficient Ey. 


p —E, p —E, 

.0000 .0137 

000 005 
0015 .0169 

001 .006 
.0045 

.002 
.0075 

.003 
.0106 

.004 
0137 


This indicates that for values of p satisfying .0106 < p < .0075, we have 
—E,(p) = .003. The inequality on the right is usually explained by 
the statement, “In critical cases, ascend.” | 

Standard round-off practice is illustrated by the following examples: 


.34;, <496,. 15, 229, 


which become, when rounded to one place, 
SO. 205. uey:- ae: 


In words, a contribution of less than halfa unit in the last place retained 
is ignored, a contribution greater than half a unit is counted as a unit, 
and a contribution of exactly half a unit is ignored or counted as a unit, 
according as the preceding digit is even or odd. In some tables special 
devices are introduced to give some indication of the amount of round-off. 

Italics are often used to indicate a change in suppressed leading figures 
in a table; an alternative notation is the use of an asterisk. 

In some tables in which linear interpolation is usually permissible, it is 
the custom to print first differences in italics whenever linear inter- 
polation is not permissible, according to the standards adopted. For 
instance: 


x f(x) 
7.5 .000553 

— 53 
7.6 500 

— 47 
7.7 453 

— 43 
7.8 410 

— 39 
7.9 37] 
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2.41 Collections 


There are several collections of tables, some prepared by organizations 
and others by individuals, which should be accessible. Most of these 
belong to the second class, though some are more appropriate in the third 
class. 


British Association for the Advancement of Science 


A Mathematical Tables Committee of the British Association for the 
Advancement of Science was active from 1873 until 1939, when its 
activities were transferred to the Royal Society Tables Committee. The 
publications of the first committee will be denoted by BAAS 1, .. . , and 
those of its successor by RS 1,.... A series of shorter mathematical 
tables is also being issued by the new committee; they will be denoted by 
RSS 1,.... All volumes are now published by the Cambridge Uni- 
versity Press, London. 


BAAS 1 “Circular and Hyperbolic Functions,” 1951. 
BAAS 2 “Emden Functions,” D. H. Sadler and J. C. P. Miller, 1932. 
3. “Minimum Decompositions into Fifth Powers,” L. E. 
Dickson, 1933. 
BAAS 4 “Cycles of Reduced Ideals in Quadratic Fields,” E. L. Ince, 
1934. 
BAAS 5 “Factor Table, Giving the Complete Decompositions of All 
Numbers Less than 100,000,”’ 1935. 
BAAS 6 “Bessel Functions,” pt. 1: “Functions of Orders Zero and 
Unity,” 1950. 
7 “The Probability Integral,” 1939. 
8 ‘“‘Number-Divisor Tables,” 1940. 
BAAS 9 “Tables of Powers Giving Integral Powers of Integers,” 19+40. 
0 “Bessel Functions,” pt. 2: Functions of Positive Integer 
Order, 1952. 
BAAS 11 “Legendre Polynomials,” 1946. 
BAAS 12 “Airy Integral,” J. C. P. Miller, 1946. 


Royal Society 


RS 1 “Farey Series of Order 1025,” E. H. Neville, 1950. 

RS 2 ‘“Rectangular-Polar Conversion Tables,” E. H. Neville, 1956. 

RS 3. “Binomial Coefficients,” 1956. 

RS 4 “Tables of Partitions,” H. Gupta, C. E. Gwyther, and J. C. P. 
Miller, 1958. | 

RS 5 ‘Representations of Primes by Quadratic Forms,” 1960. 

RS 6 ‘Tables of the Riemann Zeta-function,”’ 1960. 

RS 7 ‘Bessel Functions,” pt. 3: “Zeros and Associated Values,”’ 1960. 
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RS 8 ‘“Mansell’s Tables of Logarithms,” in preparation. 
RS 9 “Indices and Primitive Roots,” A. E. Western and J. C. P. 
Miller, in press. 


Royal Society Shorter Mathematical Tables 


RSS 1 “A Short Table for the Bessel Functions / nvts(*)s = 2k Kix); 
C. W. Jones, 1952. 

RSS 2 “Bessel Functions and Formulae,” 1953. (This is a reprint of 
the introductory material in [BAAS 10]). 

RSS 3 “A Short Table for Bessel Functions of Integer Orders and 
Large Arguments,”’ L. Fox, 1954. 


National Physical Laboratory 


A new series of shorter mathematical tables has been started by the 
National Physical Laboratory. We denote these by NPL1,.... 


NPL 1. “The Use and Construction of Mathematical Tables,” L. Fox, 
1956. 

NPL 2 “Tables of Everett Interpolation Coefficients,’’ L. Fox, 1958. 

NPL 3 “Tables of Generalized Exponential Integrals,’’ G. F. Miller, 
1960. 

NPL 4 “Tables of Weber Parabolic Cylinder Functions and Other 
Functions for Large Arguments,” L. Fox, 1960. 

NPL 5 “Chebyshev Series for Mathematical Functions,” C. W. Clen- 
shaw, in press. 

NPL 6 “Tables for Bessel Functions of Moderate or Large Orders,”’ 
F. W. J. Olver, in press. 


Harvard University Computation Laboratory 


‘This organization issues its tables in series of volumes called the Annals 
of the Computation Laboratory of Harvard University, published since 
1945 by the Harvard University Press, Cambridge, Mass. We denote 
these by Harvard l,.... 

The main achievement to date is the series Harvard 3-14, published 
from 1947 to 1951, a tabulation of the Bessel functions J, (x) for x between 
O and 100 for n = 0(1)135. The first two volumes are to 18 decimals 
and the later volumes are to 10 decimals. In the earlier volumes the 
interval is usually .001, and in the later it is usually .01. The series also 
includes: 


Harvard 2 ‘Tables of the Modified Hankel Functions of Order One- 
third and of Their Derivatives,”’ 1945. 
Harvard 17 “Table for the Design of Missiles,” 1948. 
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Harvard 18, “Tables of the Generalized Sine- and Cosine-Integral 
19 Functions,” pt. I, pt. II, 1949. 

Harvard 20 “Tables of Inverse Hyperbolic Functions,’ 1948. 

Harvard 21 ‘Tables of the Generalized Exponential-Integral Func- 
tions,’ 1949. 

Harvard 22 “Tables of Sine ¢/¢ and Its First Eleven Derivatives,” 
1949. 

Harvard 23 “Tables of the Error Function and of Its First Twenty 
Derivatives,” 1952. 

Harvard 35 ‘Tables of the Cumulative Binomial Probability Distri- 
bution,”’ 1955. 

Harvard 40 ‘‘Tables of the Function arcsin z,” 1956. 


The series also includes two volumes of “Proceedings of Symposia 
on Large-scale Digital Calculating Machines,” Harvard 16 and 
Harvard 26, and four volumes concerned with the machinery at the 
Harvard Computation Laboratory and the techniques used in its design 
and operation, Harvard 1, 24, 25,27. Harvard | contains an extensive 
bibliography of numerical analysis. 


Tracts for Computers 

A series of tracts for computers was instituted by K. Pearson in 1919 
and is now being edited by E.S. Pearson. The volumes are denoted by 
Tract 1,.... Many volumes in this series are of greatest interest to 
statisticians: 


Tract 1 “Tables of the Digamma and Trigamma Functions,” E. 
Pairman, 1919. 

Tract 2 “On the Construction of Tables and on Interpolation,” 
pt. 1: “Univariate Tables,” K. Pearson, 1920. 

Tract 3 “On the Construction of Tables and on Interpolation,” 
pt. 2: “Bivariate Tables,’ K. Pearson, 1920. 

Tract 4 ‘‘Tables of the Logarithms of the Complete [-function,” 
1921. 

Tract 5 ‘Table of Coefficients of Everett’s Central Difference Inter- 
polation Formula,”’ A. J. Thompson, 1943. 

Tract 6 “Smoothing,” E. C. Rhodes, 1921. 

Tract 7 ‘The Numerical Evaluation of the Incomplete B-function,” 
H. E. Soper, 1921. 

Tract 8 “Table of the Logarithms of the Complete [-function,” 
E. S. Pearson, 1922. 

Tract 9 “Table of log ['(x),” J. Brownlee, 1923. 

Tract 10 ‘On Quadrature and Cubature,” J. O. Irwin, 1923. 

Tracts 11, 14, 16-22 have been combined and issued as A. J. Thomp- 
son, ““Logarithmetica Britannica,”’ 2 vols., 1952. 
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Tract 12 “Bibliotheca tabularum mathematicarum,” pt. 1: “Loga- 
rithmic Tables,” A: Logarithms of Numbers, J. Henderson, 
1926. 

Tract 13 “Tables of the Probable Error of the Coefficients of Corre- 
lation,” K. J. Holzinger, 1925. 

Tract 15 “Random Sampling Numbers,” L. H. C. Tippett, 1927. 

Tract 23 “Tables of tan—! x and log (1 + x?),” L. J. Comrie, 1938. 

Tract 24 ‘Tables of Random Sampling Numbers,” M. G. Kendall 
and B. Babington Smith, 1946. 

Tract 25 ‘‘Random Normal Deviates,’’ H. Wold, 1948. 


Biometrika 


In this section we list various tables prepared under the Pearsonian 
influence and call attention to various tables (many of interest to statis- 
ticians) which have been published in Biometrika and of which copies are 
available. All are published by the Biometrika office. 


“Tables of Incomplete I’-function,”’ K. Pearson, 1934. 

“Tables of the Incomplete Beta Function,” K. Pearson, 1934. 

“Tables of the Complete and Incomplete Elliptic Integral,” A. M. 
Legendre, 1934. 


National Bureau of Standards 


The tables prepared directly or indirectly by the National Bureau of 
Standards have been issued in three series: (1) the Mathematical 
Tables Series, denoted by NBSMT;; (2) the Columbia University Press 
Series, denoted by NBSCUP; and (3) the Applied Mathematics Series, 
denoted by NBSAMS, in which revised editions of the first are being 
incorporated. 


Mathematical Tables Series: 


NBSMT 1 ‘“‘Table of the First Ten Powers of the Integers from 1 to 
1000,”’ 1939. 

NBSMT 5 “Tables of Sine, Cosine and Exponential Integrals,” vol. 
I, 1940. 

NBSMT 6 “Tables of Sine, Cosine and Exponential Integrals,” vol. 
ITI, 1940. 

NBSMT 7 “Tables of Natural Logarithms,” vol. I, 1941. 

NBSMT 9 “Tables of Natural Logarithms,” vol. IT, 1941. 

NBSMT 11 ‘Tables of the Moments of Inertia and Section Moduli of 
Ordinary Angles, Channels, and Bulb Angles with 
Certain Plate Combinations,”’ 1941. 

NBSMT 17 “Miscellaneous Physical Tables: Planck’s Radiation 
Functions, and Electronic Functions,” 1942, 
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NBSMT 18-37 is a series of smaller tables. 
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Some of these have been 
reissued (together with certain unpublished smaller 
tables) in NBSAMS 37. 


Columbia University Press Serves: 


NBSCUP 
NBSCUP 
NBSCUP 
NBSCUP 
NBSCUP 
NBSCUP 
NBSCUP 
NBSCUP 
NBSCUP 
NBSCUP 
NBSCUP 
NBSCUP 


NBSCUP 


I 


2 


3 


nf 


“Table of Reciprocals of the Integers from 100,000 
through 200,009,” 1943. 

“‘Table of the Bessel Functions J,(z) and J,(z) for Com- 
plex Arguments,” 1947. 

“Table of Circular and Hyperbolic Tangents and Co- 
tangents for Radian Arguments,” 1947. 

“Tables of Lagrangian Interpolation Coefficients,” 
1948. 

“Tables of Arcsin x,’ 1945. 

“Tables of Associated Legendre Functions,” 1945. 

“Tables of Fractional Powers,” 1946. 

“*Tables of Spherical Bessel Functions,” vol. I, 1947. 

“Tables of Spherical Bessel Functions,” vol. II, 1947. 

“Tables of Bessel Functions of Fractional Order,”’ vol. I, 
1948. 

‘“T ables of Bessel Functions of Fractional Order,” vol. I, 
1949, 

“Table of the Bessel Functions Y,(z) and Y,(z) for 
Complex Arguments,” 1950. 

“Tables Relating to Mathieu Functions,” 1951. 


Applied Mathematics Series: 


NBSAMS 
NBSAMS 
NBSAMS 
NBSAMS 
NBSAMS 
NBSAMS 
NBSAMS 
NBSAMS 


NBSAMS 


2 
3 
4 
i) 
6 
7 
8 
9 


10 


“Tables of Coefficients for Obtaining the First Derivative 
without Differences,’ H. E. Salzer, 1948. 

“Table of the Confluent Hypergeometric Function 
F(4n,¥%;x) and Related Functions,” 1949. 

“Tables of Scattering Functions for Spherical Particles,” 
1948. 

“Table of Sines and Cosines to 15 Decimal Places at 
Hundredths of a Degree,”’ 1948. 

“Tables of Binomial Probability Distribution,’ 1950. 

“Tables to Facilitate Sequential t-tests,” 1951. 

‘Tables of Powers of Complex Numbers,” H. E. Salzer, 
1950. 

‘Tables of Chebyshev Polynomials S,,(x) and C,,(x),” 
1952. 

‘“‘Tables for Conversion of X-ray Diffraction Angles to 
Interplanar Spacing,” 1950. 
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NBSAMS 
NBSAMS 
NBSAMS 
NBSAMS 
NBSAMS 
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‘Table of Arctangents of Rational Numbers,” J. Todd, 
1951. 

“‘Monte Carlo Method,” A. S. Householder, G. E. 
Forsythe, and H. H. Germond, eds., 1951. 

““Tables for the Analysis of Beta Spectra,” 1951. 

“Tables of the Exponential Function e¢’,”’ 1951. 

“Problems for the Numerical Analysis of the Future,” 
1951. 

“Tables of n! and ['(n + 12) for the First Thousand 
Values of n,” H. E. Salzer, 1951. 

“‘Tables of Coulomb Wave Functions,” vol. I, 1952. 

“Construction and Applications of Conformal Maps,”’ 
E. F. Beckenbach, ed., 1952. 

““Hypergeometric and Legendre Functions with Appli- 
cations to Integral Equations of Potential Theory,” 
C. Snow, 1952. 

“Tables for Rocket and Comet Orbits,” S. Herrick, 
1953. 

**A Guide to Tables of the Normal Probability Integral,” 
1952. 

“Probability Tables for the Analysis of Extreme-value 
Data,” 1952. 

“Tables of Normal Probability Functions,” 1952. 

“Introduction to the Theory of Stochastic Processes 
Depending on a Continuous Parameter,” H. B. Mann, 
1952. 

“Table of the Bessel Functions Y,(x), Y,(x), Ko(x), 
K,(x),0 <x <1,” 1952. 

“Table of Arctan x,’”’ 1953. 

“Table of 107,” 1953. 

**Table of Bessel-Clifford Functions of Order Zero and 
One,”’ 1952. 

“Linear Simultaneous Equations and the Determination 
of Eigenvalues,” L. J. Paige and O. Taussky, eds., 
1952. 

“Tables of Coefficients for the Numerical Calculation of 
Laplace Transforms,” H. E. Salzer, 1953. 

‘Table of Natural Logarithms for Arguments between 
Zero and Five to Sixteen Decimal Places,” 1952. 
“Table of Sine and Cosine Integrals for Arguments from 

10 to 100,”’ 1953. 

‘The Statistical Theory of Extreme Values and Some 

Practical Applications,’ E. J. Gumbel, 195+. 
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NBSAMS 34 


NBSAMS 35 
NBSAMS 36 


NBSAMS 37 


NBSAMS 38 


NBSAMS 39 


NBSAMS 40 


NBSAMS 41 
NBSAMS 42 


NBSAMS 43 
NBSAMS 44 
NBSAMS 45 
NBSAMS 46 


NBSAMS 47 


NBSAMS 48 


NBSAMS 49 


NBSAMS 50 
NBSAMS 51 


NBSAMS 52 
NBSAMS 53 
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“Table of the Gamma Function for Complex Argu- 
ments,’’ 1954. 

“Tables of Lagrangian Coefficients for Sexagesimal In- 
terpolation,” 1954. 

‘Tables of Circular and Hyperbolic Sines and Cosines 
for Radian Arguments,” 1953. 

“Tables of Functions and of Zeros of Functions: Col- 
lected Short Tables of the Computation Laboratory,” 
1954. 

“‘Magnetic Fields of Cylindrical and Annular Coils,” C. 
Snow, 1954. 

“Contributions to the Solution of Systems of Linear 
Equations and the Determination of Eigenvalues,” 
O. Taussky, ed., 1954. 

“Table of Secants and Cosecants to Nine Significant 
Figures at Hundredths of a Degree,” 1954. 

‘Tables of the Error Function and Its Derivatives,” 1954. 

““Experiments in the Computation of Conformal Maps,”’ 
J. Todd, ed., 1955. 

“Tables of Sines and Cosines for Radian Arguments,”’ 
1955. 

“Table of Salvo Kill Probabilities for Square Targets,” 
1954. 

“Table of Hyperbolic Sines and Cosines, x = 2 to 
x = 10,” 1955. 

“Table of the Descending Exponential, x = 2.5 to 
x = 10,” 1955. 

‘Contributions on Partially Balanced Incomplete Block 
Designs with Two Associate Classes,’ W. H. Clat- 
worthy, 1956. 

“Fractional Factorial Experiment Designs for Factors at 
Two Levels,” Statistical Engineering Laboratory, 
1957. 

‘Further Contributions to the Solution of Simultaneous 
Linear Equations and the Determination of Eigen- 
values,” 1958. 

“Tables of the Bivariate Normal Distribution Function 
and Related Functions,” 1959. 

“Table of the Exponential Integral for Complex Argu- 
ments,” 1958. 

“Integrals of Airy Functions,” 1958. 

“Tables of Natural Logarithm for Arguments between 
Five and Ten to Sixteen Decimal Places,”’ 1958. 
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NBSAMS 54 “Fractional Factorial Experiment Designs for Factors at 
Three Levels,’ W. S. Connor and M. Zelen, 1959. 

NBSAMS 56 ‘Tables of Osculatory Interpolation Coefficients,” H. 
E. Salzer, 1959. 

NBSAMS 57 “Basic Theorems in Matrix Theory,” M. Marcus, 1960. 


Akademia Nauk SSSR 


A series of excellent tables is being produced in Russia. Among these 
are: 


*“Tables of Logarithms of Complex Numbers,” 1952. 

A. A. Abramov, “Tables of In ['(z) in the Complex Domain,” 1953. 

‘**Tables of Fresnel Integrals,”’ 1953. F 

K. A. Karpov, “Tables of the Function w(z) = a) é dx in the Com- 
plex Domain,” 1954. . 

‘*Tables of Sine and Cosine Integrals,’”’ 1954. 

“*Tables of the Exponential Integral,” 1954. 

L. N. Karmazina, ‘Tables of the Jacobi Polynomials,”’ 1954. 

“Tables of ¢? and e~7,”’ 1955. 

A. D. Smirnov, “Tables of Airy Functions,”’ 1955. 

K. A. Karpov and S. N. Razumovskii, ““Tables of the Logarithmic In- 
tegral,”’ 1956. 

E. N. Dekanosidze, “Table of Cylinder Functions of Two Variables,” 
1956. 

L. N. Karmazina and L. V. Kurochkina, “Table of Interpolation Co- 
efficients,’ 1956. 

A. I. Vzorova, “Tables for Solution of the Laplace Equation in Elliptic 
Domains,” 1957. ? 

K. A. Karpov, “Tables of the Function F(z) = i) é’ dx in the Complex 
Domain,” 1958. e 

E. A. Chistova, ““Tables of Bessel Functions of Real Arguments and 
Their Integrals,” 1958. 

L. N. Karmazina and E. A. Chistova, ‘“Tables of Bessel Functions of 
Imaginary Arguments and Their Integrals,” 1958. 

V. I. Pagurova, “Tables of the Integral-exponential Function £,(x) 


- | *e-™u-" du,” 1959, 
1 
I. E. Kireeva and K. A. Karpov, “Tables of the Weber Functions,”’ 
vol. 1, 1959. 


M. I. Zurina and L. N. Karmazina, ““Tables of the Legendre Func- 
tions P_,,,,,(x),”’ vol. 1, 1960. 


L. N. Nosova, ‘“Table of Kelvin Functions and Their First Derivatives,” 
1960. 
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The following important table was published in Russian but is not in 
the series of the Akademia Nauk: 


V. N. Faddeeva and N. M. Terentiev, ““Tables of the Values of the 
Probability Integral of Complex Arguments,”’ 1954. 


2.42 Individual Authors 


In addition to the tables in the series described in the preceding section, 
there are certain individuals who have produced or directed the produc- 
tion of tables of importance for the general worker or the specialist. 
Among these are J. Peters, L. J. Comrie, J. C. P. Miller, H. T. Davis, 
and K. Hayashi. 

The tables of general use in this category include the following: 


L. M. Milne-Thomson, “Jacobian Elliptic Functions,’’ New York, 1950. 
E. Cambi, ‘‘Eleven and Fifteen Place Tables of Bessel Functions of the 
First Kind, to All Significant Orders,’’ New York, 1948. 


An elaborate collection of constants to many decimals, by J. Peters, 
J. Stein, and G. Witt, is given in the appendix of the following work: 


J. Peters, ‘“Ten-place Logarithm Table,” vol. I: ““Ten-place Logarithms 
of the Numbers from | to 100,000 together with an Appendix of 
Mathematical Tables,’ New York, 1957. 


2.43 Texts and Treatises on Numerical Analysis 


There follows a short list of books on numerical analysis which are 
currently available 


1. N. I. Achieser, “Vorlesungen iiber Approximationstheorie,”’ 
Akademie-Verlag G.m.b.H., Berlin, 1953. 

2. D. N. de G. Allen, “Relaxation Methods,” McGraw-Hill Book 
Company, Inc., New York, 1954. 

3. F. L. Alt, “Electronic Digital Computers,” Academic Press, Inc., 
New York, 1958. 

4, E. F. Beckenbach, ed., “‘Modern Mathematics for the Engineer,” 
McGraw-Hill Book Company, Inc., New York, First Series, 1956; 
Second Series, 1961. 

5, E. F. Beckenbach, ed., “Construction and Applications of Conformal 
Maps,”’ National Bureau of Standards, Applied Mathematics Series; 
vol. 18, 1952. 

6. A. A. Bennett, W. E. Milne, and H. Bateman, ‘‘Numerical Inte- 
gration of Differential Equations,’ Dover Publications, New York, 
1956. 

7. E. Bodewig, “Matrix Calculus,”’ North Holland Publishing Com- 
pany, Amsterdam, 1956. 
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. A. D. Booth, “Numerical Methods,’ Academic Press, Inc., New 


York, 1958. 


. R.A. Buckingham, “Numerical Methods,” Sir Isaac Pitman & Sons, 


Ltd., London, 1957. 


. H. F. Bueckner, “Die praktische Behandlung von Integralgleich- 


ungen,”’ Springer-Verlag, Berlin, 1952. 


. L. Collatz, “Eigenwertaufgaben mit technischen Anwendungen,” 


Akademische Verlagsgesellschaft, m.b.H., Leipzig, 1949. 


. L. Collatz, ‘Handbuch der Physik,” II, pp. 369-470, ““Numerische 


und graphische Methoden,”’ Springer-Verlag, Berlin, 1955. 


. L. Collatz, ““Eigenwertprobleme und ihre numerische Behandlung,”’ 


Akademische Verlagsgesellschaft m.b.H., Leipzig, 1945. 


. L. Collatz, “The Numerical Treatment of Differential Equations,” 


Springer-Verlag, Berlin, 1960. 


. S. H. Crandall, “Engineering Analysis,” McGraw-Hill Book Com- 


pany, Inc., New York, 1956. 


. American Mathematical Society, ‘‘Numerical Analysis: Proceed- 


ings of Symposia in Applied Mathematics—Volume VI,” J. H. 
Curtiss, ed., McGraw-Hill Book Company, Inc., New York, 
1956. 


. P. S. Dwyer, “Linear Computations,” John Wiley & Sons, Inc., 


New York, 1951. 


. Engineering Research Associates, ‘High-speed Computing De- 


vices,” W. W. Stifler, Jr., ed. McGraw-Hill Book Company, Inc., 
New York, 1950. 


. V. N. Faddeeva, ‘Computational Methods in Linear Algebra,” 


C. D. Benster, tr., Dover Publications, New York, 1959. 


. G. E. Forsythe, Contemporary State of Numerical Analysis, in 


“Surveys of Applied Mathematics,” vol. 5, John Wiley & Sons, Inc., 
New York, 1958. 


. G. E. Forsythe and W. A. Wasow, ‘“‘Finite-difference Methods for 


Partial Differential Equations,” John Wiley & Sons, Inc., New 
York, 1960. 


. L. Fox, ‘‘Numerical Solution of Two-point Boundary Problems,”’ 


Clarendon Press, Oxford, 1957, 


. R. A. Frazer, W. J. Duncan, and A. R. Collar, “Elementary 


Matrices,’” Cambridge University Press, London, 1938. 


. D.Gibb,‘‘Interpolation and Numerical Integration;’ George Bell & 


Sons, Ltd., London, 1915. 


. V. I. Goncarov, “Interpolation and Approximation,” 1954 (in 


Russian). 


. E. M. Grabbe, S. Ramo, D. E. Wooldridge, eds., ‘Handbook of 


Automation, Computation and Control,” vols. I-III, John Wiley & 
Sons, Inc., New York, 1958-1961. 


Go gle 


108 
27. 


28. 
29. 
30. 
31. 
32. 
33. 
34. 


35. 


36. 
37. 
38. 
39. 
40. 
41. 
42. 
43. 
44, 
45. 
46. 


47. 
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D. R. Hartree, “Numerical Analysis,’’ Clarendon Press, Oxford, 
1958. 

C. Hastings, Jr., “Approximations for Digital Computers,”’ Prince- 
ton University Press, Princeton, N.J., 1955. 

F. B. Hildebrand, “Introduction to Numerical Analysis,» McGraw- 
Hill Book Company, Inc., New York, 1956. 

A. S. Householder, ‘“‘Principles of Numerical Analysis,” McGraw- 
Hill Book Company, Inc., New York, 1953. 

“Interpolation and Allied Tables,” H. M. Stationery Office, Lon- 
don, 1956. 

H. Jeffreys and B. S. Jeffreys, ““Methods of Mathematical Physics,”’ 
Cambridge University Press, London, 1956. 

F. John, Advanced Numerical Analysis, lecture notes, New York 
University, New York, 1956. 

C. Jordan, “‘Calculus of Finite Differences,’ Chelsea Publishing 
Company, New York, 1947. 

L. V. Kantorovitch and V. I. Krylov, “Approximate Methods of 
Higher Analysis,’’ C. D. Benster, tr., Interscience Publishers, Inc., 
New York, 1958. 

Z. Kopal, ‘“‘Numerical Analysis,” John Wiley & Sons, Inc., New 
York, 1955. 

J. Kuntzmann, “Méthodes numériques: Interpolation, dérivées,”’ 
Dunod, Paris, 1959. 

K. S. Kunz, “Numerical Analysis,’”» McGraw-Hill Book Company, 
Inc., New York, 1957. 

C. Lanczos, “Applied Analysis,’ Prentice-Hall, Inc., Englewood 
Cliffs, N.J., 1956. 

R. E. Langer, ed., ““On Numerical Approximation,”’ University of 
Wisconsin Press, Madison, Wis., 1959. 

H. Levy and E. A. Baggott, ‘‘Numerical Studies in Differential 
Equations,’ Dover Publications, New York, 1950. 

M. Marcus, ‘‘Basic Theorems in Matrix Theory,’’ National Bureau 
of Standards Applied Mathematics Series, vol. 57, 1960. 

H. A. Meyer, Symposium on Monte Carlo Methods, John Wiley & 
Sons, Inc., New York, 1956. 

S. E. Mikeladze, ‘‘Numerical Methods in Mathematical Analysis,” 
1953 (in Russian). 

W. E. Milne, ‘‘Numerical Calculus,”’ Princeton University Press, 


Princeton, N.J., 1950. 


W. E. Milne, “‘Numerical Solution of Differential Equations,” John 
Wiley & Sons, Inc., New York, 1953. 

L. M. Milne-Thomson, ‘‘Calculus of Finite Differences,” Macmillan 
& Co., Ltd., London, 1933. 
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H. Mineur, “Techniques de calcul numérique,” Béranger, Paris, 1952. 
““Modern Computing Methods,” H. M. Stationery Office, London, 
1961. 


. I. P. Natanson, “Konstruktive Funktionentheorie,’”’? Akademie- 


Verlag G.m.b.H., Berlin, 1955. 


. K. J. Nielsen, ““Methods in Numerical Analysis,” ‘The Macmillan 


Company, New York, 1956. 


. N. E. Norlund, “Differenzenrechnung,” Springer-Verlag, Berlin, 


1924. 


. A. M. Ostrowski, “‘Vorlesungen iiber Differential- und Integral- 


rechnung,” vol. 2, Birkhauser, Basel, 1951. 


. A. M. Ostrowski, “Theory of the Solution of Equations and Systems 


of Equations,’ Academic Press, Inc., New York, 1960. 


. L. J. Paige and O. Taussky,-eds., “Simultaneous Linear Equations 


and the Determination of Eigenvalues,’’ National Bureau of Stand- 
ards Applied Mathematics Series, vol. 29, 1953. 


. A. Ralston and H. S. Wilf, eds., “Mathematical Methods for Digital 


Computers, John Wiley & Sons, Inc., New York, 1960. 


. R.D. Richtmyer, “Difference Methods for Initial Value Problems,” 


Interscience Publishers, Inc., New York, 1958. 


. C. Runge and H. Konig, “Vorlesungen tber numerische Rechnen,”’ 


Springer-Verlag, Berlin, 1924. 


. J. B. Scarborough, “Numerical Mathematical Analysis,” Johns 


Hopkins Press, Baltimore, 1958. 


. F. S. Shaw, “Introduction to Relaxation Methods,’”’ Dover Publi- 


cations, New York, 1953. 


. R. V. Southwell, ‘Relaxation Methods in Engineering Science,” 


Clarendon Press, Oxford, 1940. 


. R. V. Southwell, “Relaxation Methods in Theoretical Physics,” 


Clarendon Press, Oxford, 1946. 


. J. F. Steffensen, “Interpolation,” Chelsea Publishing Company, 


New York, 1927. 


. E. L. Stiefel, P. Henrici, and H. Rutishauser, ‘‘Further Contribu- 


tions to the Solution of Simultaneous Linear Equations and the 
Determination of Eigenvalues,’’ National Bureau of Standards Ap- 
plied Mathematics Series, vol. 49, 1958. 


. “Subtabulation,’” H. M. Stationery Office, London, 1958. 
. O. Taussky, ed., “Contributions to the Solution of Systems of Linear 


Equations and the Determination of Eigenvalues,” National Bureau 
of Standards Applied Mathematics Series, vol. 39, 1954. 


. J. Todd, ed., “Experiments in the Computation of Conformal 


Maps,” National Bureau of Standards Applied Mathematics Serics, 
vol. 42, 1955. 
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68. R. S. Varga, “Iterative Numerical Analysis,” Prentice-Hall, Inc., 
Englewood Cliffs, N.J., to appear. 

69. E. T. Whittaker and G. Robinson, ‘Calculus of Observations,” 
Blackie & Son, Ltd., Glasgow, 1946. 

70. F. A. Willers, “‘Methoden der praktischen Analysis,” Walter De 
Gruyter & Co., Berlin, 1950. 

71. F. A. Willers, “Practical Analysis,” R. T. Beyer, tr., Dover Publi- 
cations, New York, 1948. 

72. R. Zurmihl, “‘Matrizen,”’ Springer-Verlag, Berlin, 1950. 

73. R. Zurmihl, “Praktische Mathematik,” Springer-Verlag, Berlin, 
1957. 

74, A. O. Gelfond, “Differenzenrechnung,” D. Verlag Wiss., Berlin, 
1958 (translation of Russian edition, 1952). 

75. A. Korganoff, with the collaboration of L. Bossett, J. L. Groboillot 
and J. Johnson, ““Méthodes de calcul numérique,” tome 1: Algébre 
non-linéaire, Dunod, Paris, 1961. 

76. D. K. Faddeev and V. N. Faddeeva, “Computational Methods of 
Linear Algebra,” Moscow, 1960 (in Russian). (Thisis an extended 
version of 19.) 

77. G. N. Lance, “Numerical Methods for High Speed Computers,” 
Iliffe, London, 1960. 

78. F. L. Alt, ed., ““Advances in Computers,” Academic Press, New 
York. This is intended to be a continuing publication, the first 
volume of which appeared in 1960. 


79, I. S. Berezin and N. P. Zidkov, ‘“‘Computational Methods,” 2 vols., 
Moscow, 1959 (in Russian). 


PROBLEMS 
2.1. Ifa, = 1, by = .2 and if, for n > 0, 
ani = Le(a, ls 6,), baay =a Va,b ns 


calculate a,, a,,... and 6,, b,,..., working to 10D. 
Repeat the calculation working to double precision. 
2.2. If x9 = .5, ¥9 = | and if, for n > O, 


rs La(xy Das J nv = Vitis 
calculate x,, x,,... and y,, ¥,..., working to 10D. 
Repeat the calculation working to double precision. 
2.3. Check the results of Probs. 2.1, 2.2 from tables using the values of the 
limits given in the text. 
2.4. Using the Newton process on f(x) = x? — N, obtain a recurrence 
relation which converges to N7l/?, 
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For what range of initial values does it converge if, for example, p = 3? 
Use this to determine (1.2345678)—% to at least 7D. 
2.5. Evaluate in the form arctan (m/n), where m, n are integers, 


2 arctan | — arctan 2 + arctan 4 + arctan 5 


+ arctan 34 — arctan 208 — arctan 479. 


(For further examples of this kind see Todd, NBSAMS 11.) 
2.6. (Etherington) Evaluate the determinants 


—73 78 24 —73 78 24 
92 66 25), 92 66 25). 
—80 37 10 —80 37 10.0) 


2.7. Evaluate 6,(x) for x = .1 and x = | and for n = 2(1) 8 where 


6,(.1) = .99750 156207, — 64(1) = .76519 768656, 
6,(.1) = .04993 752604, 6, (1) = .44005 058575 


and where, for n > 1, 


xO 141 (%) _ 2n8 ,,(x) ; x6 ,~1(%). 
Evaluate also 


r(x) _ 6, (x) aa J (x) 


where J,,(x) is the Bessel function of order n. 
2.8. Evaluate J9(5), J,(10), J9(20) using the asymptotic formula 


2 
J q(x) mm [Po(x) cos (x — 747) — Qo(x) sin (x — Yaz) ] 


12. 32 ]2. 32.52.72 
where ig gees algae 


]2 ]2. 32.52 
Yo — Trea) Pa Gye 


Compare the results you obtain with the values given in standard 
tables. 
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2.9. The following is part ofa table ofsin 0, at a constant interval, with its 
differences. Complete the table and check the final value 6, from a standard 
table. 


6)  .43313 47858 66963 
873933 05476 


6,  .43322 21791 72439 — 4073056 
873892 32420 —822 
0, 
0 
Os 
2 
0, 
—3 
65 
+2 
06 
0 
0, 
—1 
Og 
O5 
2.10. Check the following table by differencing: 
x Ff (x) 
213 310 87274 
214 527 79587 
215 745 55816 
216 964 16123 
217 1283 60670 
218 1403 89609 
219 1625 30132 
220 1847 01371 
221 2069 84498 
222 2293 52675 


2.11. Check the following table by differencing: 
x S (x) 


—3 4593 46961 

—2 19046 00625 

—| 33551 77121 
0 48211 85121 
l 63027 33681 
v4 77999 32214 
3 93128 90625 
4 ] 08417 19041 
) ] 23865 28081 
6 1 39474 28721 
7 1 55245 32321 
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Correct it and obtain {(.5) by the following methods: 

a. Linear interpolation 

b. Bessel 

c. Everett 

d. Lagrange 4-point 

e. Bessel or Everett with fourth-difference contribution 

J. Bessel or Everett with modified differences. 

2.12. Use the Aitken process to find a zero of Bi(—x) given the values: 


Bi(—.9) = .16263895 
Bi(—1.0) = .10399739 
Bi(—1.1) = .04432659 


Bi(—1.2) = —.01582137 
Bi(—1.3) = —.07576964 
Bi(—1.4) = —.13472406. 


Check your result from tables. 
2.13. Illustrate Aitken’s method of linear interpolation by determining 
the zero of Jy(x) between 5.515 and 5.525, being given the following values: 


x 5.505 9.515 5.525 5.535 


10°J,(x) —51374 — 17287 + 16740 + 50704 


Check your results from tables. 
2.14. Use the following extract from a table of Bi(x) and its reduced 
derivatives to calculate (and check) the values of Bi(2.45), Bi’ (2.45) 


x Bi(x) T 7 7 r 75 
2.4 5.61577 79418 6739 411 20 l 
2.5 6.48166 94214 8102 501 25 ] 


2.15. Illustrate various methods of interpolation (linear, Lagrangian, 
Bessel, Everett) by finding f(2.5) where is given by the following table: 


x 0 I 2 3 4 5 


SF (x) 789 1356 2268 3648 5819 9304 


2.16. Tabulate In P(x) — (x — 4%) Inx + x, for x71 =.1(—.01)0. To 
how many places is linear interpolation good ? 

Evaluate In I(x) and I(x) for x = 40 using this table. 

2.17. Construct an example to show that the error in interpolation in an 
“outer’’ interval in a 4-point Lagrangian interpolation can be significantly 
larger than that in the “‘center’’ interval. 
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12 
2.18. Evaluate | f(x) dx by various methods where /(x) is given by 
0 


x 105f (x) x 105 (x) 
0 37500 7 — 41206 
l 33794 8 — 23300 
2 23200 9 + 20794 
3 7294 10 100000 
4 — 11300 1} 224294 
5 — 28906 12 404700 
6 — 40800 


What is the exact result? [Use, for example, Simpson, (Milne), (3/8) 4, 
(Weddle)?, Gregory, central differences (Gauss).] (See BAAS 11.) 
t 


2.19. Evaluate} (1 — x2) dx by various methods—in particular, by 


0 
(a) finding the indefinite integral, (6) Simpson’s rule, (c) Gregory’s formula, 
and (d) Gauss’ central-difference method. 
2.20. Discuss the evaluation of 


+0 ,-2? 
J -[" Taw dx 


using the Hermite quadrature. (See J. Barkley Rosser, Note on Zeros of the 
Hermite Polynomials and Weights for Gauss Mechanical Quadrature 
Formula, Proc. Amer. Math. Soc., vol. 1, pp. 388-389, 1950.) 

2.21. Discuss the approximation to the integral J in Prob. 2.20 by the in- 
finite series 


hd el + n2h?) 


for varying ae ofh. (See E. T. Goodwin, The Evaluation of Integrals of 
the Form a (x)e-*"dx, Proc. Cantbridge Philos. Soc., vol. 45, pp. 241-245, 1949.) 


2.22. Verily numerically that Ai(x) satisfies the differential equation y” = xy 
at x = 1.4, being given 
x Ai(x) 


1.1 . 12004 943 
1.2 .10612 576 
1.3 .09347 467 
1.4 .08203 805 
1.5 .07174 950 
1.6 -06253 691 
1.7 .05432 479 


2.23. Verify numerically that J,(x) satisfies the differential equation 


By" tay + xy = 0, 
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for, say, x = 5, x = 20, and investigate the effect of varying the interval 4. 


[Obtain from tables the values for x = ..., 4.999, 5.000, 5.001,...;x=..., 
4.95, 5.00, 5.05,... 3% =...,4.9,5.0,5.1,...5.... Similarly forx =..., 
19.99, 20.00, ao iar 4 St ty F9.99,20,00 20.095 eas ks 199, 
20.0, 20.1, or "Use values of J 9(x) to appropriate precision. ] 


2. 24. Apply the 6? method twice to thie following sequence: 2.01761, 3.76481, 
2.70015, 3.36391, 2.94210, 3.21407, 3.03687, 3.15317, 3.07645. (See Todd 


and Warschawski [64], p. 41.) 
2.25. Apply a backward recurrence scheme to obtain values of the spher- 


ical Bessel function 
jalx) = Vm /2x Ins rg(*) 
for x = 10, n = 20, using the relation 
Jo(x) = sin x/x 
to obtain the scale factor. Check your result from tables. 
2.26. (Fox-Sadler) Prepare a table to 6D or more of 


f(x) = { a + t)- sin t dt 


describing checking procedures and suggesting interpolation methods. 
Aunts : 
a. Express f(x) in terms of the Fresnel integrals 


y y 
S(y) =| sin 14 wt? dt, C(y) =| cos 14 wl? dt. 
0 0 


Obtain tables of these integrals and use them for spot checking. 
6. Obtain a power series representation and determine over what range 
it is convenient for tabulation: 


_ —— . so (—1)"(2r + 1)! 24r+3 yer ts 
f(x) = V2/2(cos x — sin x) a2 (Gr + 3)! 
c. Obtain an ae art expansion in the form 


—(1 1-31 1-3-5°71 +) 
IG) ~ TN! — pot 2-0-2-0x8 


and determine over what range it is convenient for tabulation. 
d. Obtain the differential equation 


fit fark 
and determine over what range this is convenient for tabulation. 


2.27. Explain in detail the derivation of the number 3.3 on the ninth line 
and 1.05 on the eleventh line of the A?/,” column of IAT, p. 62. 
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3.1 Introduction 


In this chapter we present, in outline, an introduction to the construc- 
tive theory of functions. This theory can be developed on theoretical 
lines, independent of numerical applications, but it can also be integrated 
into courses in numerical analysis. 

The term constructive theory of functions is due to the Russian mathema- 
tician S. N. Bernstein (1880— ). The subject derives from the work 
of Chebyshev (1821-1894) and his pupils—for example, Korkine, Zolo- 
tareff, and the brothers Markoff. It was set up as an independent dis- 
cipline by Bernstein, to whom and to whose pupils much of its later 
development is due. 

Basically our subject is concerned with the approximate representation 
of functions in terms of simpler ones. ‘Two examples are: 

1. Functions f(x) continuous on an interval [a,b] in terms of polynomials 


k . 
> a;x' 
i=1 


2. Continuous periodic functions F(x) with F(x) = F(x + 27) in terms of 


trigonometrical polynomials 
k 
> (a; cos 7x + 5, sin ix). 
i=0 


This subject, therefore, is likely to be of considerable use in the various 
applications, but it also turns out to be an intrinsically interesting area of 
mathematics that includes many sharp and appealing results which are 
easily accessible. 
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A vital question is how the goodness of our approximations is to be 
measured, and there are widely different theories, according to which 
norm we use. The following two expansions can be developed formally, 
for the interval —1 <x <1: 

= (2n — 2)! 4n +] 
— | fe ce eS 
Isl = 4 ic) Ge ae yh oe 
—})n4 


== + SS Tal) (3.2 


where P,,(x), is ) are the usual — and Chebyshev polynomials. 
If we truncate these, we obtain the following series of approximations: 


yg, 3(5x2 +1)/16, 15(—7x4# + 14x2 +.1)/128, 2... (3.3) 


Pots. (Sh 


and 

2/7, 2(4x2 + 1)/37, 3(—16x* + 36x? + 3)/I157, 2... (3.4 
Another sequence of approximations (see Remez [32]) 1s 
Ye, x2 + %, —1.065537x4 + 1.930297x? + .067621, .... (3.5) 


It is instructive to compare the errors—for instance, in the constant and 
quadratic cases—between |x| and the expressions in Eqs. (3.3) to (3.5), 
using the following norms: 


M,: max |e(x)|, 
l 

M;: le(x)| dx, 
—]1 


M;: [ey ax)" 


In this chapter we are concerned mostly with the Chebyshev, or 
maximum, norm M,, although we have something to say about the 
rms norm @,; the norm M, is difficult to handle. However, it is ob- 
viously reasonable to try to build up an abstract theory of approxi- 
mation in normed spaces (see, e.g., Buck [3]). 

The following notations are useful inthischapter. By, (x) we under- 
stand a polynomial of degree n; if the leading coefficient a, of p,,(x) 1s not 
zero, we define 

B(x) = (a) p(t) =x" Eee, 
so that f,,(x) has its leading coefficient unity. 


3.2 The Theorems of Weierstrass 


Theorem 3.1. If f(x) is continuous in [a,b], then, given any € > 0, 
there 1s a polynomial p = p,(x) such that 


f(x) — p(x) <6, a<xx <b. 
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This was established by Weierstrass in 1885. Another form of this is 

Theorem 3.2. If f(x) 1s continuous in [a,b], there ts a serves of 
polynomials X q,(x) which is uniformly convergent to f(x) in [a,b]. 

There are analogous theorems for continuous functions which are 
periodic in an interval which we can take to be [0,27]. These can be 
established directly, or indirectly, by showing that they are equivalent to 
the above results. The analogue of Theorem 3.1 is as follows. 

Theorem 3.3. If F(6) is a continuous function which has period 27, 
then, given any «€ > 0, there is a trigonométrical polynomial T = T,: 


T(0) = May + y (a, cos 78 + 5, sin r6) 
such that a 
\F(0) — T(6)| <«, all 6. 


There are many proofs of these theorems. We describe briefly two of 
the simpler ones, due, respectively, to Lebesgue (1898) and to Landau 
(1908), but shall concentrate on that of Bernstein (1912). 

The proof that (3.1) = (3.2) is easy. It is also easy to show that 
(3.3) = (3.1). 


Lebesgue’s Proof of Theorem 3.1 


We note that, in virtue of uniform continuity, we can approximate 
f(x) arbitrarily closely by a polygonal function, g(x). Wecan represent 
g(x) as a linear combination of polynomials and distorted functions |x]. 
The distortion consists in a change of origin and of scale: a |x — 5|. The 
problem is therefore reduced to that of approximating |x| by polynomials. 
This is done by noticing that, when |x| < 1, 


jx] = V l= (1 =) 
and expanding the right-hand side as a binomial series 
1 — Wil — x?) — (1 — x?)® —--- 
There is some difficulty about the behavior of this atx = 0. This can be 


overcome by a careful analysis of the convergence of the series at 
x =0Q. Thiscan be done by examining the ratio of consecutive terms 
of the series or, preferably, by a direct discussion of the remainder in 
the Maclaurin series 

1: nC — ae 


Be: 3 - 
( 


2 Geo ono care 
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An alternative, more elementary (but more complicated) treatment 
has been given by Ostrowski [29]. 


Landau’s Proof of Theorem 3.1 


Landau’s proof depends on the use of a singular integral (delta func- 
tion). Consider, in the case a= 0, 6 = 1, 


1 
I,,(0) -| (1 — x?)" dx, 0<6 <1. 

6 
It is plausible, from consideration of the graphs of y = x", that J,(6) is 


negligible with respect to /,(0); the area represented by J,(6) is con- 
centrated near 6 = 0. It can indeed be shown that when 6 > 0 


I,,(6) /I,(0) — 0 as n—> ©. 
This suggests that 


(21,(0)}-*{ fle = (2 = 8) de > fle. 


This can be proved. Further, the left-hand side is a polynomial in x. 
This is essentially the result required. 

This proof makes use of the integral calculus, but it is not difficult to 
avoid this (see Landau [30]). 


de la Vallée Poussin’s Proof of Theorem 3.3 


The de la Vallée Poussin proof is similar to that of Landau. It is 
shown that, if 


(2n) tf 


VFO) = (2 am Qn(2n — 1)!! 


{- F(¢) cos?” 14(¢ — 60) dd, 


then V(F,0) — F(6) 
uniformly. 


Bernstein's Proof of Theorem 3.1 


This proof is best motivated from elementary considerations of prob- 
ability. The proof, however, does not depend on any probability ideas 
and, indeed, rather serves to justify these ideas. 

Consider repeated, independent experiments with two possible out- 
comes: H or 7, T or F, for which the probabilities are x, (1 — x), 
respectively. It is reasonable to suppose that in a large number n of 
experiments there will be approximately xn successes, (1 — x)n failures. 
Now, by elementary reasoning, we know that the probability of exactly 
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kK successes is p,, = (7) —x)"-*, This quantity, regarded as a 


function of k, with n, x fixed, should have a high peak where k = nx and 
should fall off rapidly on either side. This suggests that, although 
> Pre = 1, the sum over the k close to nx will be close to 1, and the sum of 
k=0 


remaining p,, will be near zero. 
With these thoughts in mind, we consider the Bernstein polynomials 


for f(x): 
BySx) = Soul (:) 


and, f being continuous, it follows that the right-hand side should be 
nearly f(k/n) = f(x), and the approximation should improve as n 
increases. 
We shall now begin afresh and establish the following result rigorously. 
Theorem 3.4. Jf f(x) is continuous in [0,1], then 


B(fa) = Zhmf (=) + fe 


uniformly in [0,1]. 
We need several lemmas. The first two can be obtained directly, or 
alternatively, by differentiating the identity 


(p+ 9)" = Z(f) ot 
with respect to f, treating p, g as independent variables. 
Lemma 3.5: 
D> Par (k aa nx) = 0. 


Lemma 3.6: 
Sha(k — nx)? = nx(1 — 2). 


We shall now establish a third lemma, essentially the Chebyshev inequality. 
Lemma 3.7. Jf 6 > 0, then X' py, < x(1 — x)/né? if X’ is the sum 
over those k for which |(k/n) — x| > 6. 
Proof. Consider the summation in Lemma 3.6. Break it up into 
&’ and a complementary &”. We have 
nx(1 — x) = >’ p,,(k — nx)? + >” p,,(k — nx)? 
es D> Pag (k - 7, nx)? 
ea > Pan? 0?. 
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The first inequality follows because each term in 2 and therefore in 2’ 
is essentially nonnegative. The second inequality follows because, by 
definition of 2’, we have |k — nx| >nd. Hence 


; x(1 — x) 
>’ Par < ar a 
Remembering that 0 < x(1 — x) < ™, we conclude that 


>” Pur = 1 — (4n6?)-}, >’ Par < (4n6?)-}. (3.6) 


(This is the precise form of the statement in italics on page 123.) 
Since 2 f,, = 1, in order to prove Theorem 3.4 it will be sufficient to 
show that 


eas) = Bula) — Sle) = & pal (=) — 200 | 


tends uniformly to zero. Wedo this by breaking up the sum into 2’ and 
x” and showing that each part, separately, tends uniformly to zero. 
The first does so because 2’ p,, 1s small and f(k/n) —_f(x) is bounded, the 
second because although 2” is near unity, since fis continuous, f(k/7) is 


near f(x). 


Take any « > 0. Then choose 6 such that 
f(x’) —f(x")| < Ke if |x’ — x"] << 6,0 <x’, x" <1. 
Let M = max | f(x)|.. Then, using (3.6), we have 


ea(*)| < Pan LF (2) — fle)! +E" Pa LF (2) — £00) 


= 2M >” Pax + Ye »” Por 
< 2M(4n6?)-1 + Me. 
To get the last inequality, we have also used the fact that 2’ p,, < Upjr 


= 1, Ifn > Me—'6-?, then the first term in the last inequality does not 
exceed 4%e, and we conclude that 


le, (x)| <¢, all x, n > Me 6-?, 


This completes the proof of Theorem 3.4. 

It is clear that (3.4) > (3.1). Itis also true that (3.1) = (3.4); this 
can be established along the following lines. 

It is clear that B,(x) is a polynomial in x of degree n at most. We 


shall show that the coefficient of x* in B, is (x) A*(0), where the differ- 


ences are taken at interval n7}. 
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We observe that x“ occurs only in the terms with k = 0,1,...,K 
and that the contribution from the (k + 1)st term is 


s()(e) 9" (k=) 


eer n! (n — k)! 
ae £(; “epi TK K —k)!(n —K)! 


= (0 () (are 


(Noel 


and the result follows, since ra mn 


nto = 0 ~ (170) + (8) s@)—~ +o (8 


Hence, if f(x) is a polynomial of aes r,sois B,,forn >r. Further, 
we can write 


] ] 2 ( k — +) AF f(0) 
k aa ke ees eas (see 
(;) asc oe a(! “)() ) n (Ax)* ’ 
and so, letting n — oo, keeping k fixed, and assuming that f‘*)(0) exists, 
we see that 
Wee) a 


(7) A*s(0) + 


the coefficient of x* in the Maclaurin a he of f(x). This idea can 
be developed into a proof of Bernstein’s theorem. Using the Weierstrass 
theorem, we get a polynomial approximating f(x) uniformly, and the 
above argument shows that the Bernstein polynomials of a polynomial 
approximate it uniformly, thus establishing the implication (3.1) => (3.4). 

There are many generalizations of the Weierstrass theorems. Axio- 
matic accounts have been given by M. H. Stone and N. Bourbaki. 

It is easy to compute the B,(f) for f = x", r = 0, 1, 2,.... For 
instance, we have 


B,(1) =1 
B(x) =x 
B(x?) = x? + x(1 — x)/n 
B, (x3) = x3 + 3x2(1 — x)/n + x(1 — x)(1 — 2x)/n?. 
It is evident that the B,(f) do not give the best approximation to f, 
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among all polynomials of degree <n; for example, if f(x) = x?,n = 2, 
the best approximation in any reasonable sense 1s x? 4 B,(x?). The 
order of approximation in this special case is typical. Voronowskaja 
[42] has established the following result. 

Theorem 3.8. Let f (x) be continuous in [0,1] and let f"(x,) be finite 
for some X%,0 <% <1. Then 


lim n (B,(%) —f(%o)] = 2h" (%0)*o(1 — %0)- 


The weakness of the Bernstein polynomials as a means for approxima- 
tion is discussed at the end of Sec. 3.3. 


3.3. The Chebyshev Theory 


We have just seen that, given any function f(x), continuous in [2,6], 
and any e > 0, there is a polynomial p = fp, such that 


If(x) — p(x)l <<, a<x <b. 


It is natural to ask, for any integer 2, what is the polynomial of degree at 
most n which is the best approximation to a given f(x). Indeed, we 
should ask, first, Is there a best approximation? and, secondly, Is it 
unique? A third question is then relevant: How can one determine the 
extremal polynomial ? 

We shall deal with a special case first: the approximation of f(x) = 0 
by a p,(x) or, what is the same thing, the approximation of x” by a poly- 
nomial of lower degree. We shall establish the following result. 

Theorem 3.9. 


Let p, (x) = x" + a,x"-1 4-++++a,. Let 


~ 


H(P,) = _max |A,(*)|. 
Then tp) She Sa. 
There is equality if and only uf 
B, = T(x) = 2!-" cos (n arccos x). 
Proof. We note that T., = +21!~" alternately at the points 
x, = cos v z/n, i. ie) eee 


Suppose there is a f, (x), say 7,,(x), such that 


ult) < Be (3.7) 
Consider 


r=T, 
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This is a polynomial of degree n — 1 which does not vanish identically. 
[If it did, then y(z,) = w(T,) = m,, contradicting (3.7).] Now con- 
sider the values of r at the points x,. Clearly r has alternate signs at 
these points. (Since |z,| < u,, the sign ofris that of T,.] By Bolzano’s 
theorem, 7 must vanish between consecutive x,, that is, n times in all— 
which implies, since it is a polynomial of degree n — 1, that it vanishes 
identically. This is a contradiction. Hence (3.7) is false. Hence 


The proof of the equality results is left as an exercise. 

We shall now show that 7;,(x) has some compensation for having the 
smallest deviation from zero within [—1,1]: it is the largest such poly- 
nomial outside [—1,1]. 

Theorem 3.10. Let p,(x) be a polynomial of degree at most n. Let 
M = max len (x)I: Then, for any real &, |&| > 1, we have |p,,(€)| < MT,,(), 


where Tax ye = cos (n arccos x). 
Proof. Suppose the conclusion is false and consider 


r(x) = [pa(S) Ti(4)/Ta(S)] — pale): 


This is a polynomial of degree at most n and 7(é) = Further, since 
|pn(&)/T,,(€)| > M and since T,(cosin/n) = (— i t= 0505-550, It 
follows that 


r(cos im/n) = (—1)'[p,(4)/T,(&)] — p,(cos tx/n) 
has opposite signs for consecutive 7, [f,,(x)| being bounded by M in 


[—1,1]. Hence r has n zeros inside [ —1,1] and another zero outside, at 


x= & Hence r(x) =0. Hence 
P,,(*) = p,(&) T,(x)/77,(); 
Pil) = p(S)/7,(6), 


so that 


and so 


—a contradiction. 


Chebyshev Polynomials 


We insert here a collection of formulas concerning Chebyshev poly- 
nomials which will be required later. 
Definitions. There are several normalizations of these polynomials in 
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current use, and care must be exercised. We define the Chebyshev polv- 
nomials of the first and second kind by 


T, 


n 


(x) = cos (n arccos x) 


— ni (—1)™(n —m—1)! 0 
Bay ml (eam 
— grax a a? + ee ‘ 
— a T(x) 
1 sin [(n + 1) arccos x] 
r ie Sed ' = OO OOO eee 
and U,,(x) aries | Dna (*) sin (arccos x) 


- (" : ‘)e " (" . ‘ea — x) 
+ (" - ‘yet ee 


= > (-)™(2 — m)! (Qx)n-2m, 


m=o m!(n — 2m)! 
We thercfore have 


P23 4 U, =1 
T, =% T,=*x U, = 2x 
T, = 2x? — 1} T, =x —% U, = 4x2 — 1 
T, = 43 — 3x T, = 8 — Mx U, = 8x3 — 4x 


T,=8—-82 41 Tyaxt—xet tk Uy = l6xt— 12x28 +1. 


We shall use the following notation for the Chebyshev polynomials 
appropriate to the interval [0, 1]: 


T(x) = T,(2x — 1). 
Recurrence Relation. Both T 


my 


p,,(%) aS 2x1 (%) as Pn—2(*) 
with appropriate imitial conditions, 
Zeros, 7,(x) = Oforx = x, = cos [(2» — 1)x/2n],v = 1, 2,...,2, 
sO that y=) — Ng Wa Sync ees 
Inequalities. |T,(x)| < 1 with equality if and only if x = cos yz/n, 
ye= Q,1,2,...,a,and[U,(4)| <a + lwithequalityifand onlyifx = +1. 


and U’, satisfy the recurrence relation 


Google 


THE GONSTRUCTIVE THEORY OF FUNCTIONS 129 


Values at Endpoints. T’,(+1) = n?. 
Lagrangian Interpolation at the Chebyshev Abscissas. When, in the 
notation of Sec. 2.6, the nodes are the zeros x, of 7,,(x), we have 


L(x) = T,,(x)/(x — x;) T,(%)- 


__ nsin (n arccos x,) 


mew SE sin (arccos x,) 
and with 
x; = cos [(22 — 1)z/2n] 
we have 
sin (m arccos x,) = (—1)*7}, sin (arccos x,) = V1 — x? 
so that 


_ _ Tile [x — x) 
n(—1)/V1 = 8 


Existence of Polynomial of Best Approximation 


We now take up the general question of the existence for each n, of a 
polynomial of degree at most 2, of best approximation to a given con- 
tinuous function. This is best handled from the point of view of elemen- 
tary functional analysis. 

The following general problem of approximation has a meaning in a 
real normed linear vector space X: 

Given an f € X and a set of linearly independent elements b,, by, ..., 6,, all in 
X, to find the linear combination & A,b, which is the best approximation (in the 
sense of the given norm) to f; that 1s, determine 


min || f — > ,6;|- 
{ay} 


We shall show that this problem always has a solution. By taking 
X = C, the set of functions f(x) continuous on [a,b], with 


F(*) | = f(x) I 


and 6, = x‘1, we obtain the result required. It is not possible to 
show that the solution to the general problem is unique. However, we 
can establish uniqueness if we insist on our space being strongly normed, 
that is, if 


l, (x) 


lx + ll = Ile + lod, x#0,7 #40, 


only ify = ax (« >0). Observe that this result does not give us unique- 
ness in the Chebyshev case, for C is not strongly normed; the latter fact 
follows from the remark that, if 


S(%) = max |f(x)| and = =—sg(x) = max [g(x)|, 
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then || f+ gil = I fll + llgll, and we can certainly choose f+ ag. 
Uniqueness is established later (Theorem 3.14). 

We begin by showing that ¢(/,,..., 4,) = ||f — 2 A,0,|| is a contin- 
uous function of its arguments. 


P(A) — (A) =| IF — > Addl — IF — > A, | 
< 1D (A; — A) 4: 
< DIAi — Add (bil 
< max |A, — 4,| > || 5, |. 


1<i<n 
The 4,’s are fixed and so therefore is & |[0,|. 
It follows, in particular, that 


¥ (A) = I> A,6;| 
is a continuous function of A. The shell 
A? t+ At@ter + A2= 1 


is a bounded, closed set in ordinary n-dimensional space, and on it the 
continuous function ‘ must assume its minimum, uw. Since ¥ > 0, 
sw >O0. Since the b; are linearly independent, u #0. Hence pu > 0. 
It follows, by the homogeneity of the norm, that for any (A,,..., ,), 


ee 
WD Abi S eV (AP + Ae + +++ + ,?). (3.8) 


Let p be the lower bound of f(A). Then p= 0, and we have to show that 
this bound is attained, that is, that there is a A* such that $(A*) = p. 
We shall show that ¢(A) is large (specifically (4) > p + 1) when Ais 
large (specifically, when V2 4,2 >R=(p +14 If ll)/u). Hence 
the lower bound of ¢(A), for all A, is the same as the lower bound when A 
is restricted by VEZ 4,2. < R, and we are now concerned with the lower 
bound of a continuous function on a ball (solid sphere). Since the ball 
is closed and bounded, the lower bound is attained, and the existence 
of A* is established. To establish the italicized statement above, we 
observe that when A is restricted to VE 4,2. > Rwe have, by (3.8), 


$(4) > Wd ao — WTI 
2er(p +14 fie -Wfl =e 41. 


Note that we have made two appeals to the fact that a continuous func- 
tion defined on a bounded, closed subset of an ordinary n-dimensional 
space attains its lower bound. 

We have now established, in particular, the following result, due to 
Borel. 
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Theorem 3.11. Jf f(x) is continuous in [a,b], then, for each n, there 
exists a polynomial (of degree <n) of best approximation. 


Characterization and Uniqueness 


Our next task is the development of the qualitative Chebyshev criterion 
for the polynomials of best approximation; we shall also obtain the 
uniqueness. We let f(x) be a best approximation to f(x) on [a,b], among 
all polynomials of degree n at most. We write 


E, = E,(f) = max | f(x) — A(x)| = min max [ f(x) — p,(2) 
x Da z 
We assume that £,, > 0—otherwise / is itself a polynomial of degree n 
at most. Consider 
If(x) — p(x)I- 


This is continuous in [a,5] and so assumes its maximum E£, in at least one 
point in [a,b]. Such points are called e points (e = extremal). We 
classify the e points into + points where the value of f(x) — A(x) is E, 
and — points where this value is —E£,,. 

Theorem 3.12. There are always both + points and — points. 

Proof. From the general properties of continuous functions there 1s 
at least onee point. Suppose, for instance, that there were no + points. 
This would mean that 


Hence E,, > p(x) —f(x) = —E, + 2A, 


which implies 


that is, 
(A(x) — A) —f(x)| < EB, — A, 


so that f(x) is not a polynomial of best approximation—a contradiction. 

Theorem 3.13. There is a sequence of (n + 2) points in [a,b] which are 
alternately +- and — points. 

Proof in linear case, E, > 0. Wenow know that there is a + point and 
a — point. We shall show that there must be another e point in [4,5] 
and that the signs of the e points alternate. 

By uniform continuity of f(x) —_f(x) = E(x), we can divide [a,4] into 
a set ofintervals J; = [x;,x,,,] by a set of points x, witha = x) <x, < 

- <x, = 5 with the following property for each 7 = 0,1, 2,..., 
n— 1: 


If xtel, x" el, ~— then — E(x’) — E(x")| < E,. 
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Consider any x’ in the interval /_ in which the + point, say x,, lies. 
Then 
Exe) SL, and |E‘x’) — E,| < 1%E,, 


so that E(x’) >12F, > 0. Similarly, for x” in the interval Z_ in which 
the — point lies, we have Ex") < —)2E, <0. These two intervals. 
therefore, cannot overlap, or even abut, and so we can choose a point 2, 
between them. Suppose that the — point isto the left. Then (z, — x) 


has the same sign in the intervals J_ as E(x) has. Let R = max [z, — xi. 
a<z<b 


Consider the “‘remaining”’ intervals J; We have, in these, 
max |E(x)| < E, 
and, there being a finite number of them, we have 
max {max |E(x)|} = E* < E,. 


We observe that E* >)2E,. For the end points of the intervals J, are 
also end points of the remaining intervals, and we have seen that the 
values there satisfy E(x’) >)@E,, E(x") < —14E,. Hence certainly 
E* > E,. 

We consider, for « > 0, 


p(x) = p(x) — «(z, — *), 
which is linear. If we choose e so small that 
eR < je — E*, 


which implies eR < /2E,, we see that the deviation of p(x) from (x) is 
less than that of f(x). For in J,, since (z, — x) > 0, and in J,, since 
(z, — x) <0, and since eR < 14E,, the values of p(x) — f(x) have the 
same signs as those of f(x) —_f(x) but are reduced in absolute value. In 
the remaining intervals, the absolute values may be increased, but they 
cannot exceed 


E+ 4+eR<E++E,—E*=E,. 


Hence f is not a polynomial of best approximation. It follows that 
there must be more than twoe points. These cannot be disposed in the 
order + + — or + — — byessentially the same argument as above: z, is 
chosen between the + point and the next — point; it follows that we 
must have analternation + —+. [Wecannot get asimilar contradiction 
in this case because we would have to take a quadratic (z, — x)(z, — x), 
which is illegal, to get a perturbation of the correct sign pattern. ] 
No essentially new ideas are needed to deal with the general case. 
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Theorem 3.14. The polynomial of best approximation (of degree <n) is 
unique. 
Proof. Suppose there were two p’, p”. We then would have 
sae OF <f-f' Ss Ew 
=, <f—f" Ss E,. 


Then p” = }2(p’ + p”) would also be a polynomial of best approxima- 
tion. We can therefore construct, by Theorem 3.13, a series of (n + 2) 


m” 


points, alternately + and —, for p”. 


un 


Take a + point, x,, for p”. Then at x,, 

p” —f=-—-E,, that is, (p” — f) + (o' —f) = —2E,,. 
Since |p’ — f| < £,, |p” —f |< £,, it follows that f— p’ =f— p’ = E,; 
that is, p’, p” coincide at any + point. Similarly, p’, p” coincide at any 
— point. But there are (n + 2) + points. Hence p’ = p’. 

Theorem 3.15. p(x) 1s the polynomial of best approximation (of degree 


<n) to f(x) uf there exists a set of (n + 2) points, alternately +- and — points. 
Proof. Suppose max | f (x) — p(x)| = uw. Thenu>E,. We shall 
a<z<b 


show that u = £,. Suppose not; that is, suppose n > E,. Let q(x) be 
the polynomial of best approximation and unique, by Theorems 3.11 
and 3.14. Then, since 

q(x) — p(x) = a(x) —f(x) + f(x) — pts), 
the signs of g(x) and f(x) coincide at the (n + 2) extrema, because 

lq—fl <£,, lf—pl=u>E,. 

Hence the polynomial (¢ — p), of degree < n, has (n + 2) changes of sign 
in [a,b] and so (nm + 1) zeros; it therefore vanishes identically. This 
gives uw = max | f — p| = max |f — g| = E£,, in contradiction with our 
assumption that uw > E,. Hence pz = E,. 
Alternative Treatment for the Prototype 


Notice that in the proofs of Theorems 3.11 to 3.15 we have not made use 
of Theorem 3.9. Itis therefore appropriate to outline here an account 
of Theorem 3.9, which is more motivated than that given earlier. We 
use Theorem 3.15 in the case f = x", a = —1, 6 = 1, and when we are 
approximating by polynomials of degree <n — 1. Then the poly- 
nomial ,_,(x) we require is characterized by having 


P(x) = +4, where P = x" — p,_,(x), 
alternately, at n + 1 = [(n — 1) + 2] points in [ —1,1], where 
f= max |P(x)|. 
-lszsl 
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Now at each interior extrema, of which there are at least n — 1, we must 
have P’ = 0. But P is a polynomial of degree n. Hence all the zeros 
of P’ are simple. Consider 

aD alae 


Since g’ = 2yy’ vanishes when _»’ = 0, all the interior extrema of y (and all 
zeros of y’) are double zeros of y? — u?. Hence _y? — yu? is divisible by 
y’*, and the residual factor is a quadratic. This quadratic must be 
M(1 — x?) since we must have | y(+1)| = uw. Hence 


yt — pt = M(L = x8)y" 


and if we compare the leading coefficients, we find M = n-2. If we 
write 7 = _»/u, we obtain 


n dx dn 


Via “Wie 


Integrating this relation carefully, we find 
” = cos (n arccos x). 


The basic ideas in this chapter can be applied in much more general 
situations—for example, to the case in which we consider weighted ap- 
proximations to continuous functions by rational functions of assigned 
degree, where the weight function is an arbitrary positive continuous 
function. On the other hand, we can consider approximations by fami- 
lies of functions which share with the polynomials the properties we need 
to draw the conclusion. 


Extrema under Different Constraints 


The problem of determining polynomials of minimum deviation where 
a coefficient, not the leading one, is fixed was studied by W. A. Markoff. 
Here again the Chebyshev polynomials turned out to be the critical ones. 
The result obtained is the following. 

Theorem 3.16. If the coefficient of x", r<n in p,(x) 1s unity, then, 
according to whether r, n have the same or different parity, we have 


max |p,(x)| 21x or max Ip,(x)| 2 Ua", 
-l<r<l 


—l<r<l 


where 7,'") 1s the coefficient of x" in T, (x). There ts strict inequality in the first 
unless p,(x) ts a multiple of T,,(x) and in the second unless p,(x) ts a multiple of 
T,,-1(x), with the following exceptions: r = Qin both cases andr = 1,n = 2 
in the second. 

This problem suggests a series of problems solved by Zolotareff which 


cannot be discussed by elementary methods. We quote one: 
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Among all polynomials of degree n, with leading coefficients unity and with their 
second coefficient fixed, determine that with minimum deviation from zero in 
[ —1,1]. 

The solution of this and the related problems depends on the theory of 
elliptic functions and has important technical applications (see Piloty 


[+1]). 
Inconsistent Linear Systems 


We note here the recent studies on the construction of the Chebyshev 
approximation to the solution of an inconsistent system of linear equa- 
tions (see, e.g., Stiefel [39]). 


Given, say, m equations 
> 4;;%; + 6, =0 
in the n < m unknowns x,,..., X,, we consider the m residuals 
"= D> 245%; + 6, 
and ask for the set of x which minimizes 
max |) 


It has been shown that this problem is not essentially different from thar 
of solving a system of linear equations by the Gaussian elimination pro- 
cess or of solving a “‘linear program,” in the sense of Dantzig, by the 
simplex method. 


Construction of Polynomials of Best Approximation 


It is convenient to state our present position. We have now estab- 
lished the existence and uniqueness of polynomials ,,(x) of best approxi- 
mation to functions f(x) continuous in an interval [a,b]. We have 
exhibited the f,(x) in the case f(x) = Oin[—1,1]. So far no finite algo- 
rithm for the construction of the extremal polynomials in a general case 
has been discovered. Indeed the extent of the special cases in which 
this has been done is remarkably small. We shall discuss some of these 
briefly. 

a. Formulascan be given for the casesn = 0 andn = 1 when f(x) isa 
twice-differentiable convex (or concave) function. 

When n = 0, ifm, M are the lower and upper bounds of f(x) in [4,6], 
itis clear that 

po(x) = }2(m + Af). 


When n = 1, suppose f is twice-differentiable and convex, so that 
f(x) > 0. By the fundamental criterion, p(x) = Ax + B is the best 
approximation if there are three points x, < x, < x, in [a@,4] for which 
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f(x) — p(x) attains its extreme values alternately. Hence x, is definitely 
inside [a,b], and we can use the differential calculus to conclude that 


f'(%) = A. 


Now, since f”(x) > 0, f’(x) is strictly increasing, so that f’(x) can only 
assume the value A once; this means that the derivative of f — p cannot 
vanish at x, or at x,, and so these extreme points must be end points: 
a= x,,b =x,. Putx, =c¢. Then, using the equality of the extreme 
values, we must have 


f(a) — pla) =f(4) — p(4) = —[ fle) — pe)], 
which are equations for A, B which give 
A= [f(4) —f(a)]/(b — 4), 
B= alfa) + f(E)] — (a + ) Lf) — f(a) ](o — 4). 


The value of ¢ is given by 


f°(6) = [F(5) — Fla) Io — 2). 


b. We discuss in some detail the case of 1/(1 + x) in [0,1]. This 
is essentially due to Chebyshev (1892) (see Hornecker [36]). 


If we write 9 = arccos (2x — 1) andc = 3 — 2V2, then we find, by 
elementary trigonometry, that 


V2% —cT#(x) + @TE(x) — OTH(x) + °°] 
= (1 + x)-! = 2(3 + cos 6)". (3.9) 


With the same notation, we can show that 


ma(x) = VIDA — eT P(x) +o + (IAT H_(2)] 


C' 


ek)? T_2 T(x) 


_ v3 20 — c*) — (—1)"c"(cos nO + cos (n — 1)8) 
7 1 + 2¢cosé + ¢? 
c" cos n@ 
| hi) nasal 
and then that 
(gy = =N" He" cos (nm + 1)9 + ecosnd + etcos (n —1)6 
la; a= 4 | 1 + 2c cos 6 + ¢c? : 
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We next verify that the term in braces on the right of the last equation 
can be written as 


{---} = cos (n6 + ¢) 


3 cos @ + 1 9V2 sin 6 


where cos p = 3 + cos 0 ; sin d = 3 4 cos 6 . 


We observe that ¢ is a function of 6. As x goes from 0 to 1, 2x — 1 goes 
from —1 to 1, 0 goes from z to 0, ¢ goes from 7 to 6._ Hence, as x goes 
from 0 to 1, cos (n@ + ¢) goes from cos (n + 1)z to cos 6 and so has 
(n + 2) extrema, alternately +1. It follows from the Chebyshev cri- 
terion that 7,(x) must be the best approximation to (1 + x)-!, and the 
error is ’4c". 

The approximation to (1 + x)~! given by truncating the expansion 
(3.9) has been discussed numerically by Lanczos [37]; it is very close to 
the optimal 7,(x). We shall pursue this question a little further. For 
n = 1, the best possible polynomial is 


l l l 
-+—=-—- hat 1 9571 —. 
ri + a 5% that 1s, 957 5x, 


whereas that given by truncating the Chebyshev expansion is 
V4(7V2 — 8) — (6V2 —8)x, thatis,  .9497 — .4853x. 


The corresponding errors are 4c = .0439 and .0503. Take n = 6; then 
E, = Mac" = 6.4 x 10-*. A rough estimate gives 


1 +x)72 — V2(% — ¢,T#(x) +++ + OTB (x)]| < 7.5 x 10-%. 


Let us next consider the approximation given to (1 + x)! by trun- 
cating the power series 1 — x + x2 —---. The remainder x*+}(1 + x)! 
depends on x, and the numbers of terms needed to get errors less than 
10-* are 


Ix} <.1, 5; |x] <.5, 17; [x] <.8, 49; [x] <.9, 104. 


Finally, consider the approximation given by the Bernstein polyno- 
mials. From Theorem 3.8, it follows that we would need to take a 
Bernstein polynomial of order about 105 to get a comparable error. 

c. Wereturn briefly tothe three sequencesin Eqs. (3.3) to (3.5). The 
first gives the best mean-square approximation (see Sec. 3.5 below), and 
the third gives the actual polynomials of best approximation. Specifi- 
cally, for instance, x? + 1% is the best cubic approximation, for 


x24 6 — |x| 
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takes on the values 


”, —%, %, —-h, 
—1, -—% 0, %, 1. 


at the points 


Current Developments 


There has been a recent revival of interest in the construction of algo- 
rithms for the determination of polynomials of best approximation (see, 
e.g., Hastings [33], Remez [32]). In practice it has been found that the 
truncation of an expansion in Chebyshev polynomials gives nearly opti- 
mal results. 

The question of determining the most efficient methods of interpola- 
tion has been taken up recently, and the relations between the Chebyshev 
ideas and the Comrie throwback have been investigated (see, e.g., Fox 
[24]). The use of “economized”’ polynomials has also been studied. 


3.4 The Theorems of the Markoffs 
Let p,(x) be a polynomial of degree n, such that 


lP.(x)| <1, ail aoe 2h. 


Does this imply any restriction on the bounds of the derivatives of p, (x) 

for —1 <x <1? This question was raised by the chemist Mendeleev 

for the case of f;(x) and was answered by A. A. Markoff in 1890. 
Theorem 3.17. Jf |p,(x)| <1 for —1 <x <1, then 


[pa(x)| <n, —l<x<l. 
This result is a best possible one: there is equality if and only if 
p,(*) = €T,(x); le] = 1, x=tl. 


Repeated applications of this theorem give the following result. 
Theorem 3.18. Jf |p,(x)| <1 for —1 <x <1, then 


PL oiarESa lee afar alee 


fork =1,2,...,n. 
This result is not best possible iff > 1. The best possible result is the 
following, whichis due to W. A. Markoff. 
Theorem 3.19. Jf |p,(x)| < 1 for —1 <x <1, then 
| n2(n2 — 12) +++ [n®? — (k — 1)2] 
(k) (. a ne ec a ne See ay ee 
Perl = fegeeu (Qe = 1) ae 
The critical polynomial is again 7,,(x). The original proof of this 
theorem was rather complicated. Asimple proof of a somewhat weaker 
result has been given by W. W. Rogosinski. A comparatively simple 


l. 


lA 
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proof, based on Lagrange interpolation, and using some complex 
variable ideas, has been given by Duffin and Schaeffer [43]. They 
show that it is even sufficient to assume 


lPn(x)| < lat x = cos (ym/n), » = 0, 1,2,...,20 


To compare these two theorems consider the case of 7,(x), for which 
7; (x) = 96x? — 16. The exact bound of 7y is therefore 80, as given 
by Theorem 3.19; the weaker Theorem 3.18 gives a bound of 144. 

We shall now outline a proof of the result of A. A. Markoff. We begin 
with the following result. 

Theorem 3.20. If p,,_,(x) satisfies the inequality 


(=x) peaatead,. <4 sae]; 
then [Pri(*)| <2, —-l<x<l. 


Proof. We use Lagrangian interpolation at the Chebyshev abscissas. 
We have, identically, 


pili oy 


x; 


We consider separately the behavior of p,_, in the three subintervals 


[—1,*,], [*n%1]5 [%,1]. 
In the middle interval, since x, = —x, = cos (7/2n), we have 


27 
] — x2)\% see Sh eee, aed, 
( x?) > sin x in ils 
Hence, the hypothesis implies the conclusion immediately. 
The two end intervals are treated similarly; we deal with [x,,1] only. 
In this interval, 7,,(x) = cos (n arccos x) increases from 0 to 1, and each 
x —x,>0. Hence, our a 


me! 
[Pn—1\ - em yes = Lo), 0G 


x; Hn 


The result can be shown to be the best possible: there is equality if and 
only if p,_1(x) = rU,_1(%), [7| = I, x= atl. 

A trigonometrical consequence of (3.4) is the following. 

Theorem 3.21. Let s5($) = a,singd + -:: +4, sin nd satisfy 


Is(@)| <1. 
Then Is()/sin d| <n. 
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Proof. Apply the preceding result to p,_,(x), where 
fa-1(cos 6) = s(4)/sin ¢. 
This result is the best possible: there is equality if and only if 
s(p) = +sin nd. 


We next establish the second of the following two equivalent results. 

Theorem 3.22. If t,(¢) ts a trigonometrical sum of order n and 
max |f,(¢)| = 1, then max |t,(¢)| > 71. 

Theorem 3.23. If t,,(6) ts a trigonometrical sum of order n, then |t,(0)| < 
n max |t,,(6)|. 

Proof. ‘We may assume max [t,(6)| = 1. Consider 


s(8,¢) = [t,(@ + ¢) — t,(8 — 4)]. 
Since sin [r(6 + ¢)] — sin [r(6 — $)] = 2 cos 76 sin rd, 
cos [7(9 + ¢)] — cos[r(6 — ¢)] = —2sin 16 sin rd, 


the hypothesis of Theorem 3.21 is satisfied for any value of the parameter 
6. It follows that 


5(8,4) 
ane <n. 
s(9,6) _ #,(0 + ¢) —4,(9 —¢) ¢ 
But Ay ny aan Ce S 
and letting ¢ — 0, we find 
(6) <n 


for any fixed 6. 

An immediate consequence is the following inequality, valid in the 
interior of (—1,1). 

Theorem 3.24. If p,(x) satisfies |p,(x)| < 1 for -—1 <x <1, then 


[pn(x)| <n/(1 — x*)4, —-lex<l. 


We are now able to deal with the theorem of A. A. Markoff. The in- 
equality established in Theorem 3.24 can be written as 


In-*p,(x)(1 — x?) 4] <1, 
and we can apply Theorem 3.20 to deduce 
| In"pn(x)| <2, 
that is, [pa(x)| <n? 


The inequality cases can be traced by going over the argument care- 
fully. 
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3.55 Orthogonal Polynomials 


The purpose of this section is to indicate some of the properties of 
orthogonal polynomials relevant in the present context. Accountsofthe 
general theory are available, for example, in Szego [14], Erdélyi [34], 
and Sansone [31]. 

The basic fact is that, given any “‘reasonable”’ weight function w(x) 
which is nonnegative in an interval [a,] (which may be finite or infinite) 
and whose integral over any subinterval of [@,5] is positive, we can con- 
struct from the sequence of powers 1, x, x?,... a sequence of polynomials 
a,(x), of exact degree n, which are normalized and orthogonal with re- 
spect to w(x) in [a,b]. That is, 


(m()ma(2)) = | u(x) ma(a)e0(%) dx = (myn). (3.10) 


Here we use the inner product notation 


(Fa), al)) = [fledaa)eo(2) a 


We may assume that the coefficient of x" in 7,,(x) is positive. This or- 
thogonalization can be carried out explicitly by the Gram-Schmidt 
process, provided all the moments 


b 
Ln of x"w(x) dx, WO. D238 ts 


exist and are finite—this is the meaning of the word “‘reasonable’’ above. 
The proof of the existence of {7,(x)}is by induction. We take a = 1 
and then normalize this by taking 7) = 7,/V (7,7,). Assume, there- 
fore, that 7, 7,,..., 7, have been constructed and satisfy (3.10). Then 
take 
Tr 41 Sent h > (x"+)77,)77,. 
r=1 
Since 1, x,..., x"*+! are linearly independent, z,,,, cannot be zero. It 
is easy to see that, fors = 0,1,...,2, 


(Tasusm) = (X"*40,) — 2 ("4 7,) (7) = 0. 
r=1 


Also, 7,41 18 of exact degree n + 1. Hence we may take 


T+. Bical M Gra sth) 


This completes the proof. 
To establish the uniqueness we require the following lemma. 
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Lemma 3.25. If f(x) is continuous and nonnegative in [a,b] and if 
b 
owe dx = 0, then = f(x) = 0 in [a,b]. 


Proof. If this is false, then there is ac,a <c¢ < bsuchthat f(c) = 0. 
Hence, by continuity and nonnegativity, there is an interval [d,e¢], in- 
cluding c, throughout which f(x) > %f(c) > 0. Hence 


0 = | Feadu( x)w(x ) de = [flan a) dx > flo) ['w( x) dx > 0 


—a contradiction. 
We shall now show that the sequence {7,,(x)} is unique. 
We note first that, if the 7,(x) = k,x" +--+ satisfy (3.10), then, since 


l = | m2x)0(4) dx = [ a (x)k,x"w(x) dx 
we have 


b 
[ =Gir(a) dx = 1]k,. 

Suppose that a(x) = Aix” +°:++, a(x) = Aix" +--+ both satisfy 
(3.10). Then 


b 
[ mx) mo (x)wo(x) de = RJR, = KR, 


a 


so that ki = +k}. Since both are positive, we have k, = k,, and so 


b 
[ ma(s)mi (aus) de = 1. 
We now note that 


[tra — n(x) ]2w(x) de =1—-2+41=0. 


Hence, by our lemma, 
77,(x) = my(x). 
We next establish the following result. 

Theorem 3.26. All the zeros of 7,(x) are real and distinct and lie in the 
interval [a,b]. 

Proof. Consider the zeros of 7,(x). Since all the coefficients are 
real, complex zeros occur in conjugate pairs, « + 78. The correspond- 
ing factors of 7,,(x) can be combined as (x — a)? + £?, and this is posi- 
tive for all x. If there are any (real) zeros of even multiplicity, the 
corresponding factor (x — «,)?"" is nonnegative for all x. The residual, 
real zeros are of odd multiplicity; denote by 5,, . . . 5, those which lie in 
[a,b]. Clearly & <n, and our result follows if k = n. 
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Observe now that 
(x — b,)(x — 5) +++ (x — b,)7,,(x) 


is of constant sign in [a,5]. It follows from Lemma 3.25 that 
b 


This can happen only if & = n, for if k <n, the integral vanishes by 
orthogonality. 

We now quote without proof two further properties of orthogonal 
polynomials. 

Theorem 3.27. Any three consecutive orthogonal polynomials are connected 
by a linear relation of the form 


teai(t) = (Ax + B,)m,(x) —C,m,(x), 2=1,2,.... 


Theorem 3.28 (interlacing). If Zz < 2 <-+-+: < z,, are the zeros of 
m,(x), and if Z, < Zp <°+* < Z,,, ave the zeros of 7,,,,(x), then 


Ge Ly 2 Lee Sy ee Ze ee, ee Le 


We now want to discuss some extremal properties of orthogonal ex- 
pansions. Itis convenient to discuss this in asomewhat general setting.* 
Suppose given a set (or space) of functions provided with an inner prod- 
uct ( , ). Two functions, f, g, of this space are said to be orthogonal 
if (f,g) = 0. <A sequence of functions {¢,,} is said to be an orthogonal 
system if (¢,,¢,,) = 0, 2 4 m; itis said to be a normal orthogonal system 
if, in addition, (¢,,¢,) = 1. Given such a normal orthogonal system, 
we can ask whether it is possible to represent an arbitrary function fas a 
linear combination of the 4,: 


f= > AP n- 


Proceeding formally, on this assumption, it is easy to calculate the a,. 


Indeed, 
(fbn) oan (> A,P,,P,,) = > a,($,,¢,,) al > a, (r,n) = ane 


These a, are called the Fourier coefficients of f with respect to {¢,}. The 
fact that we can calculate the Fourier coefficients in a special case gives 
us no guarantee about the convergence of 2 a,¢, or that its sum is f, 
should it be convergent. 

It is clear that, if there were nontrivial functions f such that 


(f¢,) = 9, 


* A natural setting for this theory is the space L?. It is not our purpose here to 
develop the appropriate theory rigorously and the proofs we give are formal. 
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then different functions could have the same Fourier coefficients. We 
call a sequence {¢,} complete if this cannot happen. For instance, the 
sequence sin x, sin 2x, ... is not complete in (0,27), for all the Fourier 
coefficients of cos x with respect to this sequence vanish. [It is, how- 
ever, complete in (0,7).] 

The formal series & a,¢,. is a the Fourter series of f with respect to 


{¢,}. The partial sums, f, = > a,o,, of this are called se Fourier poly- 
r=0 


nomials of f with respect to {¢,}; we call any finite sum by Cp, a trigo- 
nometrical polynomial of degree n. 

Theorem 3.29. Among all the trigonometrical polynomials of degree n, 
that which gives the best mean-square approximation to f(x) 1s the Fourter poly- 
nomial, 

Proof. We want to minimize, with respect to the c,, 


[ [so - Seat] 
(We have taken a simple, weightless case!) We have 
[ [7 - >, Crbe( | ‘a -[ 70 236] for x) dx 
+S eatal Hela) tule) ds 


=| P) de 2 Bae, + Se? 
=[ Pe ae > (Cp = a,)* = > a,F. 


The right-hand side is not less than 


[po & — Sat =[[s0) — $ a4,t0) | 


no matter what the ¢, are, and there is equality if (and only if) c, = a,. 
This is the result required. The same argument applies in the general 
context to show that the Fourier polynomial is the best approximation 
in the sense of the norm induced by the inner product. 
Theorem 3.30. Among all polynomials f,(x) of degree n and leading 
b 


coefficient unity, that which minimizes [ fn(x)2w(x) dx is the 7,,(x), where the 
a 


{7,,(x)} are orthogonal with respect to w(x) over [a,b]. 
Proof. We can write 


Go gle 


THE CONSTRUCTIVE THEORY OF FUNCTIONS 145 


Hence 


[ ‘Be2(x)w(x) de = [ "ad(x)w(x) dx +2"5 a, a(2) (2) dx 


r=0 


+S aay a(x) A, (a)e0(2) de 
=| atom ») dx +5 a8 


r=0 

In words, this theorem says that 7,,(x) is the best mean-square approxi- 
mation to zero by polynomials of degree n and leading coefficient unity. 

We shall now discuss a little further the concept of completeness intro- 
duced earlier. 

Theorem 3.31. If the range [a,b] 1s finite, the corresponding sequence 
{,,(x)} ts complete (for continuous functions). 

Proof. Suppose f(x) 1s continuous in [a,b]. Take any ¢ > 0. 
Then, by the Weierstrass theorem, there is a polynomial p(x) = p,(x) 
such that 

[f(x) — p(x)| <<, a<x <b. 


Now, if we take n larger than the degree of p,(x), we have, by Theorem 
3.29, 


[Ue — Aero) de < [LF) — poe) ds 


and so 
[Ue footw ax =| payw(s) dx — Sat sme 
Hence 0 <[ Pew) dx — >a? < fy, e?. 


This means, e being arbitrary, that 
b ee) 
{ f?(x)w(x) dx = > a,?. (3.11) 
a r=0 


This equality is Parseval’s theorem. 

To show that {z,,(x)} is complete for continuous functions, we have to 
show that, if all the Fourier coefficients of any continuous f(x) vanish, 
then so does f. But this is now evident from (3.11) and Lemma 3.25, 

Theorem 3.32. Jf f(x) is continuous tn a finite interval [a,b] and if all its 
moments 


b 
bn =| xefla)uo(x) de = 0, RAO Ne De onic 


then f(x) ts identically zero. 
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fas eae - 


This “finite-emoment theorem” is ecuivalent to Theorem 3.31. Itis 
imbDortant to note that we cannet extend this to an infinite integral. 
This is shown, for examc.e. bv the following example, due to T. J. 


Srielries: 
. ” os iad 


Px, =exp —-x¢ sng 
3.6 Interpolation and Interpolation Schemes 
The bases of the theory of peivnormial interpolauon are presented in 


Chap. 2. We now discuss some extremal problems which arise in that 
theory and which are now accessibie. 

The classical error estimates for n-point Lagrangian and hermiuan 
Interpolation are* 


a 
JX 1g TX = ames AR, a a a 


ae Eas 


fx — Ha fx =D xin t oat eo a 


where it is assumed that the derivatives which occur exist in an interval 
including the nodes x,, x.,..., x, and the current point x and where 5, 
which depends on x, les in this interval. 

‘a, Suppose we are going to interpolate in [—1,1], using an n-point 
Lagrangian formula, for a function fx, whose nth derivative 1s 
bounded in [—1,1]]. What is the best choice of x,, x2,..., %,2 From 
the above representation of the error, it follows that this choice is that 
which minimizes the maximum of 


This means that the x, should be taken as the zeros of 7,/x‘, that is, 
x, == eos (127 —-1 se 2n), ae (ee aera | 


‘6, Suppose that we are going to interpolate in [—1,1], using an z- 
point Hermite formula, for a function f’x) whose 2nth derivative is 
bounded in [—1,1]. What is the best choice of x,, x:,..., x, if we 
measure the errorin the .\f, norm? That is, we want to minimize 


-1 
[ es pe Ce en | (x — x ))2+-+ (x — x,)? dx. 
me aie | =1 


It follows from Theorem 3.30 that the x, should be taken as the zeros of 
the Legendre polynomial P,,(x). 


* Note that the notation here differs slightly from that in Chap. 2. The nodes 
there denoted by x9, x1,... 5 %,_, are here denoted by x,, x9, ... 5 Xp. 
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The way in which we try (n + 1)-point interpolation if n-point inter- 
polation is not satisfactory suggests a study of a sequence of (Lagrangian) 
interpolation schemes defined by a sequence of nodes: 


1). 2 2). 3 3 3). 
PLL ALR ce ALC Ce 


all in a fixed interval [a,b]. Suppose we pick on a special function f(x), 
defined in [a,6], and we consider forn = 1, 2,... the Lagrangian inter- 
polant of degree n — 1, L,_,(f,x), with nodes x,'", x,',...,x,!”. 
We can ask whether 


LAA fx) > f(x) as n +00 (3212) 


for any or all of the points x € [a,b]. There are many questions and 
some solutions to problems of this kind. We prove two positive results. 

Theorem 3.33. If f(x) is a polynomial, (3.12) is true_for all x. 

Proof. If f(x) has degree N, then for n > N, L,( fjx) = f(x), and the 
sequence {L,,( f,x)} is stationary and therefore convergent. 

Theorem 3.34. If f(x) 1s any continuous function, we can choose the 
x,'™ so that (3.12) will be true for any x € [a,b]. — 

Proof. By the fundamental existence theorem 3.11, for each n, there 
is a polynomial 7,,(x) of best approximation to f(x). We know that the 
difference 7,,(x) —_f{(x) vanishes n + 1 times in [a,b]; these zeros we take 
to be the x,,‘"*!. ‘This means that the corresponding L,(f,x) is forced 
to be z,,(x). But we know that {7,(x)} converges uniformly to (x). 

Another application of the fundamental theorem is to the proof of the 
following result of Erdés and Turan (1937), concerning interpolation at 
the zeros of the orthogonal polynomials 7,(x), where orthogonality is 
with respect to w(x) in [a,b]. Denote by L, (fx) the corresponding 
Lagrangian interpolant. 

Theorem 3.35. For every continuous f(x), we have 


lim “w(2) [L,. (fx) — f(x) ]2 dx = 0. 


m—> 0 
This can be interpreted to mean that (3.12) is true in a mean-square 


sense. 
We now discuss some negative results. First of all, it is tempting to 
suppose that, when the nodes are equally spaced, we should have 


L, (fx) +S (4) 
uniformly in [a,5] and so have another proof of the Weierstrass theorem. 


This is false even for very simple functions. 
For instance, Bernstein, using real variable methods, established the 


following result. 
Theorem 3.36. Jf f(x) = |x| and the x,'™ are equally spaced in 
[—1,1], there ts convergence only at 0, +1. 
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The following result is due to Runge. 

Theorem 3.37. If f(x) = (1 + x?)-! and the x,'™ are equally and 
symmetrically spaced in [—5,5], then the sequence {L,(f,x)} ts divergent tf 
|x| > 3.6334.... 

We shall outline a proof of this, using complex variable theory. 

We rely on the following relation (cf. Sec. 2.6): 


Ey ie Ato ee ee ey Real) 
Dri Je (z — a)u,(z)  ~S) ~ Mea) — 2B pun (Ba 


where f(z) is regular in and on a closed curve @ except at a finite set of 
poles p,, where it has residues R,, where the distinct nodes a,, ..., a, lie 
in the interior of @, where x # a, (r = 1, 2,...,m), and where w,(x) = 
(x — ay)(x — a) +++ (x —a,). 

Let us apply this to the case f(x) = (1 + x*)-1. Then, if @ includes 
+1, the integral is zero, for we can deform @ to an arbitrary large circle 
|z| = R on which | f(z)| = O(R-?), which implies that the integral is 
o(1) and therefore zero. Combining the terms corresponding to the 
poles +1, we find, the nodes being assumed symmetrical with respect to 
the origin, 

1 w,(x) 


S (x) ~ Lil f,*) aa ] af vw (i) : 


Convergence, therefore, depends on the behavior of w,(x)/w,(i). We 
observe that, ifr, = |x — a,|, then 


log ley (x) N= = (log ry ++ +++ + log 1, 


Suppose further that a, = —5 + 10(r — 44)(n — 1)“ forr = 1,2,...,n. 
Then the above expression is a Riemann sum for 


+5 
11% | log |x — &| dé, 
—§ 


which can be integrated explicitly. 

The curves /(x) = 4 are equipotentials for the logarithmic potential of 
a uniform mass distribution on [ —5,5] and are ovals expanding from the 
segment [—5,5] ask increases. Ifx and: lie on different equipotentials 
k,,k, with k, > k,, so that J(x) > /(z), then for sufficiently large n we have 


log |w,,(x)|"" — log |w,(2) |" > Me(k, — k,). 


This means that 
|w,,(x)| 
|w,(2)| 


> exp ban(k, — ky) > o0. 
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If we evaluate /(x) and solve the transcendental equation 


I(x) = H(i), 
we find x = 3.6334.... 
We conclude by listing, without proof, two further negative results. 
Theorem 3.38. No matter what the x,‘™ are, there are continuous func- 
tions f(x) for which (3.12) ts not true uniformly in [a,b]. 
Theorem 3.39. If the x,'™ are the Chebyshev abscissas, there are con- 
tinuous functions for which (3.12) ts always false. 


3.7 Approximate Quadratures and Quadrature Schemes 


The idea of an approximate quadrature of the form 


I= [4 dx = Q = 2 AS(x) 


1s a very ancient one, and various practical aspects of it are covered in 
Chap. 2. Here we concentrate on some of the more theoretical aspects. 

An obvious approach to this problem is the following: an approxima- 
tion to the integral of a function is the integral of an approximation to 
the function. Application of this idea leads to Lagrangian quadratures. 


If 
f(x) = Lx) = & S*) 


then 
1+ Q=[S fledhs) & = E flr Ties w sano 
The error estimate given earlier is 


(n+1) 
— QI om alt [ite — m)( )(x — x) +++ (x — x,)| de. 


This result ve, * any (n + 1)-point Lagrangian quadrature for a 
polynomial of degree nis exact. This is also true in the weighted case. 
As in Sec. 3.6, we can ask what is the best choice of the nodes ~x,, x,, 
.,x,? One answer to this was given by Korkine and Zolotareff in 
1873, based on the above estimate: the Lagrangian quadrature based on 
the zeros of U,,(x) is a best possible one. ‘This follows from the following 
theorem. 


Theoresn 3:40;, “The-ininimin calue af [ FG ae, weer “all poly. 


nomials of sills n, with leading coefficient unity, 1s 2!-", and this is attained 


only by f(x) a 0, (x). 
Proof. It will be enough to show that 


+1 
x’ sign U(x) dx = 0, ifr =0,1,2,...,2—1 
a 


ey ae iff =n. 


(3.13) 
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This will imply that 
+1 
| B,,(x) sign U,,(x) dx = 21-" 
-1 


and therefore, invariably, that 


+1 
[1.1 ax > 2 
af 
and also that 


+1 
| |U,(x)| dx = 21-", 


+1 
It follows that, if { |b,(x)| dx = 21-", we would have 
~1 


zc) [1 — sign f,(x) sign U,,(x)] dx = 0. 
Sif 


This implies that 
sign J, (x) sign U,(x) = 1, 


so that the zeros of f,,(x), U,,(x) must coincide; however, each has leading 
coefficient unity, and we must have ,(x) = U,(x), establishing the 
uniqueness of the extremal polynomial. 

We establish (3.13) as follows. Putting 6 = arccos x, we have to 
evaluate 


I, = | "cos 6 sign (sinn + 1 6/sin 6) sin 6 d@. 
0 


We can omit the sin 9 in the argument of sign. Now 


sign (sinn + 160) = (—1)* ifkm < (n+ 1)0 < (A + 1)z. 


(k+1¢ 
Hence | a’) "| cos” 6 sin 6 dé, 
k¢ 


where the summation is fork = 0, 1,...,2 and where ¢ = a/(n + 1). 
Thus 
(7 + 1)2, = > (—1)**[cost1(k + 1)d — cos*+k]. 


The evaluation of J, = 2!-" 6(r,n) now follows by elementary trigo- 
nometry. 

A reasonable question is whether we can get better quadrature formulas 
if we do not insist on their being Lagrangian. Ina Lagrangian quadra- 


b 
ture, assigning the nodes determines the multipliers 4; = { I(x) dx. It 
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is plausible that we can choose the 2n quantities /,, x; in such a way as to 
satisfy the 2n equations 


b 
[ x dx = Day, oe eee. ane 


which would imply that the quadrature is exact, that is, 


[ Penaals) de = E APonalt) 


for any polynomial of degree 2n — 1 at most. ‘This is indeed the case; 
moreover, we can find such quadratures for any fixed positive weight 
function w(x). In fact, we have the following result. 

Theorem 3.41. Let {7,,(x)} denote the polynomials orthogonal with respect 
fow(x) on [a,b]. Then, tf x; are the zeros of 7,,(x), we have 


b 
[ benale)u(s) de = SA Pon ale) 
for any polynomial po,_,(x) of degree 2n — 1, where 
oe 7 7r,,(X) 
A; -| w(x)l,(x) dx, Li) = teen: 


This result has been established in Chap. 2. We note here, addition- 
ally, that the following converse is true. 
Theorem 3.42. If x,,..., x, are distinct points in [a,b] such that 


[ Pen-ale wets dx = > A; Pan—1(%1) 


for certain numbers i, and for all polynomials po,,_,(x) of degree 2n — 1 at most, 
then x,,..., x, are the zeros of a polynomial of degree n, orthogonal with respect to 
l,x,..., x"! over the interval [a,b] with weight function w(x). 

We note an alternative representation of the multipliers of Christoffel 
numbers /,, using the notation of Sec. 3.5: 


Kast = 1 
k,, Waa Xela, X,) 


Theorem 3.43. The Christoffel numbers i, are always positive. 
Proof. The quadrature formula in Theorem 3.41 is necessarily exact 


for 
F(x) =X) = miP(x) (4 — 1) 
since this is a polynomial of degree 2n — 2. It is clear that 
TARY 220; pile 2k Neg el 
S (Xi) = 77 (%))- 


d. Ss 
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b 
Hence { [7 ,2(x)/(x — x,)?]w(x) dx = Aj7,?(x,), 


that 1s, 
7 b : = 1 ,(x) 
A, -| CEG: = ang =e)" 


so that A; > 0,2 = 1, 2,. n. 
Theorem 3.44. If f2(x) exists in [a,b], then we have 


b 
— Q = [f(E)/(2n!)} | #,2(a)e0(x) ds 
for some & in [a,b]. 
Proof. Let x,, %.,...,*, be the zeros of, (x). Consider the Hermite 
interpolant H(x), introduced in Chap. 2, which satisfies 


Hi(x;) = f(x;), H'(x;,) = f'(x:), 1 = l, 25 ene | n 
We have noted that 


fen [E(x)] a eee 
I(x) — H(x) oa ad ( n)?s 


If we multiply across by w(x) and integrate between (a,b), we find 


Mone dx = [Haw (x) w(x) dx 


[os “ne 2 (x — x)? +++ (x — x,)?20(x) de. 


Now, since H(x) is of degree 2n — 1, we have, exactly, 
b 
[ Hew) dx = > 41,H(x;) = > A,f(x,) 
b 
Hence owe dx = DA, f(x,) +R, 


a JEON SS — x~,)2--- (x — x. )2wlx) dx 
where R, -| oa i) ( n) w(x) dx. 


But (x — x,)?- ++ (x — x,)?w(x)/(2n)! 


is positive in [a,b], and so the mean-value theorem gives us the result we 
require. 

The integral in Theorem 3.44 can be calculated for any particular set 
of orthogonal polynomials; the results in the classical cases are given 
below. 
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Our last topic is that of quadrature schemes (see the discussion of in- 
terpolation schemes of Sec. 3.6). We now consider two triangular 
arrays, one of nodes and one of multipliers: 


(1) (1) 
xy A, 
(2) (2) (2) (2) 
a Xo A, A, 
(3) (3) (3) (3) (3) (3) 
x, Xo Xg A, A, A, 


A special case is that of a Gaussian scheme, where the x,'” are the zeros 
of the orthogonal polynomials associated with a weight function w(x) 
and where the A,'” are the corresponding Christoffel numbers. 

We consider the truth of the relation 


QnA) = EAs) +f fla) a (3.14) 


The following result is due to Polya (1933) and Stekloff (1916). 
Theorem 3.45. Jn order for (3.14) to hold for every continuous function 
f(x), it ts necessary and sufficient that 


(1) (3.14) hold for every polynomial f(x). 
(i1) y |A,'™| be bounded. 
F=1 


The most appealing proof of this is by methods of functional analysis. 

We note that (i) 1s satisfied for any Lagrangian quadrature. We note 
also that in the case when the A,‘") are nonnegative, (i) => (11); for if we 
take f(x) = 1, Q,(f) — (6 — a), so that Q,(f) is certainly bounded but 
0,(f) = > A," = > |A,'™|. Combining these two observations, we 
deduce from Theorem 3.45 the following result. 

Theorem 3.46. A Lagrangian quadrature scheme with nonnegative 
multipliers 1s convergent for any continuous function. 

It can be proved that the multipliers are positive in the cases in which 
the nodes are the zeros of 7,,(x), or of U,(x). This result is to be 
distinguished from the corresponding special case of Theorem 3.43. 

Theorem 3.47 [Stieltjes (1884)]. The general Gaussian quadrature 
Scheme, for any weight function w(x) on an interval [a,b], is convergent for any 
continuous function f(x). 

Proof. By the Weierstrass theorem, given any e > 0, there is a poly- 
nomial p(x) = p,(x) of degree N, say, such that |p(x) —f(x)| <, 
a<cx <b. 


b 
V-Q,(f)| < [feo dx — | p(s)w(s) dx 
[pts de — Qu(0)| + 1Qs(0) — Qa 


+ 
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b 

The first term on the right does not exceed e [ w(x) dx by our choice of 
Ja *b 

p(x). The third term similarly does not exceed « © A, = €| wix' dx 


[the last equality follows by taking f(x) = 1]. All this is true for anv n. 
If we take n so that 2n — 1 > N, the middle term vanishes, since the 
quadrature is exact. Hence we have 


7 —-Q,(f)| =< 2e [ w() dx, 2n—l>N 


and so, e being arbitrary, 
I ve lim Q,.(f). 


We conclude with the following negative result, of which we do not 
give the proof. 

Theorem 3.48. A Lagrangian quadrature scheme with equally spaced 
nodes 1s not convergent for every continuous function. 


Results for the Classical Cases 


Chebyshev, 7,: @ = —1, 6= 1, w(x) = (1 — x*)7"8, 

Nodes: cos [(2k — 1)z/2n],k = 1, 2,...,n. 

Christoffel numbers: a/n,k = 1,2,..., 2. 

Coefficient of f ")(£) in error estimate: a/[22"-1(2n) !]. 
Chebyshev, U,,: a= —1, b= 1, w(x) = (1 — x)", 

Nodes: cos kv/(n + 1), 4 = 1,2,..., 7. 

Christoffel numbers: {7 sin? [Am/(n + 1)]M(n + 1),k = 1, 2,...,2. 

Coefficient of f‘?")(5) in error estimate: a/[(2n) ! 22"71]. 
Legendre, P,: a= —1,6=1, w(x) =1. 

Coefficient of f ?")(¢) in error estimate: 22"+1!(n!)4/[(2n) !]3(2n + 1}. 
Laguerre, L,'"): a = 0, 6 = 0, w(x) = x7 exp (—x). 

Coefficient of f'?")(£) in error estimate: 2! T(n + 1 + «)/(2n)! 
Hermite, H,: a= —o, b = a, w(x) = exp (—2?). 

Coefficient of f"(£) in error estimate: V2(n!)/[(2n) ! 2"). 

The numerical values of the nodes and the Christoffel numbers in the 
cases where they are not known explicitly have been tabulated in various 
forms. Among the standard tables are the following: 


Legendre: 
A. N. Lowan, N. Davids, and A. Levenson, Bull. Amer. Math. Soc., 
vol. 48, pp. 739-743, 1942. 
P. Davis and P. Rabinowitz, J. Res. Nat. Bur. Standards, vol. 56, pp. 
35-37, 1956. 
Laguerre: 
H. E. Salzer and R. Zucker, Bull. Amer. Afath. Soc., vol. 55, pp. 
1004-1012, 1949. 
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Hermite: 
H. E. Salzer, R. Zucker, and R. Capuano, J. Res. Nat. Bur. Standards, 
vol. 48, pp. 111-116, 1952. 


We note that in the Hermite case it is sometimes convenient to tabu- 
late, instead of the Christoffel numbers A, themselves, the numbers 
a, = A, exp x,?, for then we can write the Hermite quadrature formula 
as 


[= im (x) dx =|" f (x)etPen* dx = Yai f(x), 


and similarly for the Laguerre case. This device seems to be due to A. 
Reiz and has been adopted in some of the tables cited above. 


PROBLEMS 


3.1. If S,(p) => (i) etar-M —np)™, obtain a three-term recurrence 
k=0 


relation for S,,(p) and use it, and Lemmas 3.5-3.6, to evaluate S,(p), 
S,(p) (see Lorentz [12]). 
3.2. Evaluate the de la Vallée Poussin integrals 


2n)!! oe 
V_(F8) = aa | FE) cost 34($ — 6) db 


—wT 


for some simple periodic functions F(6)—for example, F(@) = |cos 6]. 
3.3. Express the de la Vallée Poussin integral 


2n)!! adie! 
VFO) = age | FLA) cost” Lal — 0) df 


— wT? 


explicitly as a trigonometrical polynomial 
nn 
V,, = YA, + > (A, cos 7 + B, sin 78) 
r=1 


and show that the ratios of A,, B, to the corresponding Fourier coefficients of 
F(¢) are (n!)?/(n + 7)!(n —r)!. {Express cos?" 144(¢ — 0) first as a sum of 
cosines of multiples of (¢ — 6), then use cos [r(¢@ — 0)] = cos rd cos 7 + 
sin rp sin 78, and then integrate.} 

3.4. Evaluate B,(e*,x) and establish the convergence and rate of conver- 
gence directly. 

3.5. Ifm < f(x) < M,0 <x <1, show thatm < B,(fix) < M. 

3.6. Show that given any function /(x) continuous in [a,b],0 <a<6b<1, 
it is possible to find a sequence of polynomials {p,,(x)} with integral coefficients 


such that 
p(x) +f (x) 


uniformly in [a,b] (see Lorentz, [12]). 


Google 


156 SURVEY OF NUMERICAL ANALYSIS 


+1 


3.7. Show that { T,,2(x) dx = 1 — (4n? — 1)74. 
1 


3.8. Find the polynomial ax* + bx + 1 which has least deviation from 
zero in [—1,1], a, being arbitrary. 

3.9. Repeat Prob. 3.8 for ax? + x + bd. 

3.10. Establish the minimum deviation of 7T,(x), using elementary argu- 
ments. 

3.11. Find the polynomial of degree n and leading coefficient unity which 
vanishes at x = 0, x = 1 and which deviates least from zero in the interval 
O0<x <1. {Consider 7,,[(2x — 1) cos 2/2n].} 

3.12. Let P, be the class of all polynomials p(x) of degree at most & of the 
form p(x) = 1 + x + a,x? +--+ + a,x*. For each p(x), there is a greatest 
+ = 7(p) > 0 such that —r < x < O implies |p(x)| <1. Find the poly- 
nomial E,(x) in P, for which r = 7, 1s maximum. What is the value of 7,? 
Write down E,(x) for k = 2, 3, 4 (see Franklin [40]). 

3.13. Find (a) the best constant and (4) the best linear approximation in 
the Chebyshev sense to arctan x,in0 <x <1. Drawa rough graph of the 
error in the linear approximation. [(a) 7/8, (6) .7854x + .0355.] 

3.14. Find (a) the best constant and (4) the best linear approximation in 
the Chebyshev sense and in the mean-square sense to x7! in 1 <x < 2. 
Find the corresponding errors. [(a) 34, log2; (6) -KWx+%+4+ VM, 
(12 — 18 log 2)x + (28 log 2 — 18).] 

3.15. Repeat Prob. 3.14 for 107 inO <x <1. [(a) 1%, 9M = 3.9086. 
(b) 9x + (1 + 9M — % log 9M), i.e., 9x + 2.5946; (664 — 108M?)x — 
24M + 54M?,i.e., 8.2934x — .2380. Wehave written M = loge = .4343.] 

3.16. We have seen that, among the polynomials f,(x) = x" + ---, that 
which is the best approximation to zero in the Chebyshev sense is 7',(x) and 
that which is the best approximation to zero in the mean-square sense is 
P (x). Compare, for general and for large n, the efficiencies of 7,(x) and 
P \(x), as approximations to zero, in both norms (see Oberhettinger and 
Magnus [35]). 

3.17. Obtain the expansions of |x| given in (3.1) and (3.2). Obtain also 
the expansion in terms of Chebyshev polynomials of the second kind. 

3.18. Expand arctan x and log [(a + x)/(a — x)], where a> 1, in a 
series of Chebyshev polynomials of the first kind in [—1,1] (see Murnaghan 
and Wrench (38}). 

3.19. Expand In (1 + x) in a series of Chebyshev polynomials of the first 
kind in [0,1] (see Murnaghan and Wrench [38]}). 

3.20. Suppose {7,(x)} are the polynomials orthogonal with respect to w(x) 
in the interval [a,b] and write 


i(x) = m,(x)/7n(x,) (x — %,)- 


a. Prove that /; and /,; are orthogonal with respect to w(x) in [4,6]. 
b. Prove that 


> * w(s )[l(x)] dx = | w(s) de 
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3.21. Write down an expression for the error f(x) — L,(fjx) in the 
Lagrangian interpolation for a function {(x), for which f “)(x) exists, at the 
points which are the zeros of 7,)(x). Assuming that | f!)(x)| < 1, find an 
estimate for the numerical value of this error at a point x,, —1 <x, <1. 
What are the corresponding numerical values when the nodes are (a) the 
zeros of U, (x), (b) the zeros of P(x), and (c) the points +1, +%, +%, 
+%, +. 

3.22. Show that the quadrature 


3 — 
[709 dx = 56 f(2 —V%) + %f(2) + 4 f(2 + V%) 


is exact when / is a polynomial of degree at most 5. 

Indicate how you would obtain a similar result which would be exact when 
fis a polynomial of degree at most 7 and the integration is over an arbitrary 
interval [a,5]. 

3.23. P,(x) being the Legendre polynomial, show that for —1 <x< 1 
we have 

IPa(2)| <1, |Pi(x)| < Mann + 1). 


3.24. Find the least upper bound of |U,(x)| for —1 <x <1. 
3.25. Show that | 7,,(x + ty)| <|7,(1 + y)| when —1 <x <1,-oO < 


y < o© (see Duffin and Schaeffer [43]). 
3.26. Show that the transcendental equation /(x) = I(t) of Theorem 3.37 is 


(1 + léx) log (1 + Vx) + (1 — Kx) log (1 — 4x) = log 1.04 + 36 arctan 5 


and verify that x = 3.6334... is a root. 
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4.1 General Introduction 


It is desirable at this point to restate the terms of reference for the present 
chapter. Its original aim was to introduce experienced mathematicians 
to the actual use of automatic computers and, at the same time, to give 
them material from which they could build courses for class use. It has 
been the authors’ experience that many mathematicians begin program- 
ming courses, that some even complete the courses and write programs, 
but that few ever take them tothe machine. The authors have had con- 
siderable experience in these matters and have a corresponding sym- 
pathy for both the instructor and the student. A suggested solution 
to the problem just stated is presented below. 

In the first place, programming courses should be highly concentrated, 
the student should have no other commitments, and the instructor should 
be always available. Secondly, plenty of time for actual use of the ma- 
chine should be available. ‘The latter may be difficult to justify when 
only an upper-class machine 1s available, but the authors consider it 
worthwhile to delay for an hour or so the results of a calculation of doubt- 
ful accuracy on a problem of uncertain value in order to allow one new 
mathematician to get familiar with the machine. Such machines as the 
SEAC, UNIVAGC, and the Datatron 220 have been found quite suitable 
for beginners. 

In order to maintain the interest of the students, the material used 
should have some novelty and mathematical content; worked examples 
in number theory and Monte Carlo, for example, are preferred to those 
of the form, ‘Fill every cell with a stop command.” 
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It is recommended that the initial words be written in the machine 
language and not in any pseudo codes. Only when the student has 
himself realized the possibility of pseudo codes, should he be allowed to 
use them. 

The frightening complication of the actual subroutines, compilers, etc., 
at any particular working organization makes them unsuitable for a 
beginner; the worked examples should begin with programs which do 
not involve subroutines, and then a few subroutines and some elementary 
processing routines for them should be written by the student before he 
is allowed to graduate into current practice. 

The authors are fully aware of the enormous duplication of effort in 
the programming field and appreciate the efforts directed toward uni- 
formization, but these efforts have not yet produced anything definitive 
enough to present to beginners. Perhaps a more systematic presentation 
will soon be available to replace the following naive approach. 

For a general introduction to the subject of automatic computers, 
see Alt [13], and for an informal account of the logic of computer 
usage see Chap. 5. 


BASIC TRAINING ON A HYPOTHETICAL COMPUTER 
4.2 Introduction 


In this chapter we construct a hypothetical automatic computer and 
learn how to program for it. Although no such computer exists, the 
ability to write programs for it will be transferable immediately to any 
existing automatic computer. The differences are only in machine 
characteristics, not in what is referred to as programming logic. 

We begin with an entirely abstract situation, which we later make con- 
crete. Imagine an unlimited number of compartments, or cells, num- 
bered O, 1, 2, . . . in each of which a real number, or an instruction, can be 
stored (“‘instruction”’ is defined below). ‘The numbers 0, 1, ... are the 
addresses of the cells. We make the convention that, if « is the address of 
a cell, then («) denotes the contents of the cell. Similarly, if x is a real 
number which is stored in a cell, then [x] denotes the address of such a 
cell. Notice that (a) is uniquely defined but that [x] is not, since more 
thanonecellmaycontaintherealnumberx. Always, however, ([{x]) = x. 
We use the symbol — ambiguously to mean “goes into,”’ “replaces,”’ 
or “go to.” An instruction is any of the following expressions: 
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All except the final instruction are arithmetic instructions, the first, 
for example, being understood to mean that the contents of cells « and 
B are added together, and the result stored in compartment y. Similar 
interpretations are given to the next three instructions. It is further 
assumed that once any of these instructions has been performed, the 
next instruction performed is the one following immediately in 
sequence. 

The last instruction, however, is of a different nature and is a logical 
instruction. Here the contents of cells a and 8 are compared, and if (2) 
< (8), the next instruction performed is the one in cell y, whereas if 
(x) > (8), then the next instruction performed is the one that follows 
immediately in sequence. 

We pay no attention at this point to stopping, starting, input, or out- 
put. These matters are considered later. 

We are now in a position to write simple programs. These must per- 
force be limited severcly, since we have made no provisions for modifying 
instructions, but the basic idea of a loop can be made plain. 

Instructions are written with the address to the left and an explanation 
on the right. 

Example 1. Compute x”, where x is a real number, p a positive integer, and 


p= (1), x = (2), 1 = (3). 


10 (4) — (4+) +4 Set (4) = 0 (r = 0) 
1] (3) + (3)—>5 Set (5) = 1 a? a) 
12 (3) + (4) +4 ] + rreplaces r 

13 (2) + (5) —>5 xxl = x4" replaces x" 

14 (4) < (1) ~ 12 Ifr < p, goto 12; otherwise go to 15. 


Instructions 10 and 11 are “‘setting”’ instructions, which are necessary be- 
cause we assume nothing about the contents of a cell. Instruction 12 is a 
“tally” instruction, which records the power r to which x has been raised. 
Instruction 13 is the “work” instruction, and instruction 14 is the ‘“‘compari- 
son’’ instruction which produces the loop. Notice that instruction 15, which 
we have not specified, would presumably be a command to print (5) or to 
halt. 

Example 2. Let « > 0,1 >x > 0 be given. Compute e* with an error not 
exceeding €. 

Weset 7, = 1, T,, = (a/n) T,_,2 2155S, = 1,5, =S,_, + T,,2 21. Then 
for every n > 1, e7 =S,, +4, where 0 <t, < T,e. Thus, if T, < e/e, 
then 0 <e7—S, , <e. 

The program follows, with storage assignments to the left: 


lon lO (7) + (8) > 1 Setn = |. 
yams Ores Gerertee? 11) (7) + (8) +2 Set 7, = 1 CT 31); 
So py Opes 12 (7) + (8) +3 Set S$, = | (S245 =.) 
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4 Temporary 13. (5) — (1) +4 - — 4, 
storage 
x 14 (4) + (2) 2 : T,_, = T, replaces T,_,. 
6 e/fe 15 (2) <(6)-+19 If T,, < e/e go to 19; otherwise 
go to 16. 
7 I 16 (3) + (2) +3 S++ T, =S, replaces §,_. 
8 0 17 (1) + (7) > 1 n+ | replaces n. 
18 (8) <(7)-+13 = Gotol3. 


Instructions 10 to 12 are setting instructions. Instructions 13 and 14 com- 
pute 7, from 7,_,. Instruction 15 decides whether or not the computation 
has gone far enough. Instruction 16 computes S, from S,_, and 7,. In- 
struction 17 “‘resets” n, and instruction 18 is a transfer instruction. Although 
instruction 19 has not been specified,it would presumably be a print command 
or a halt command, as in Example 1. 

Examples | and 2 illustrate different methods of controlling the length 
ofaloop. In example 1, we counted until a specified number of multi- 
plications had been performed. In example 2, we computed until a 
number produced by the computation was made smaller than a specified 
quantity. 

Example 3. Compute (a,b), the greatest common divisor of the nonnegative integers 
a, b. 

We use the properties (a,b) = (b,a) = (a,b — a) and (0,b) = 6. Weassume 


10 (2) < (1) 14 If b < a, goto 14; otherwise go to 11. 
1} (1) <(4)-18 Ifa <1, go to 18; otherwise go to 12, 
12 (2) —(1) > 2 b — areplaces 6. 

13. (3) < (4) +10 Go to 10. 

14 (1) + (3) 5 

15 (2) + (3) > 1 Interchange a and 3, 

16 (5) + (3) +2 

17 (3) < (4) - Il Go to 11. 


The instruction 13 is an “‘unconditional transfer’’ built out of the comparison 
instruction. Actual machines will usually have such an instruction in their 
repertoire, and it will be faster than this artificial one. 

Once again the unspccified instruction in cell 18 would be a halt command 
or a print command. 

It must be understood that this is a most impractical algorithm to compute 
(a,b). 

It is instructive to follow the course of a computation in detail. Sup- 
pose that, in Example 3, a = 9, 6 = 30. The instructions would be 
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executed as follows: 


(1) (2) (1) (2) 


10 11 12 13 
9 30 3. 6 
(1) (2) 
ll 12 13 10 
9 3 
(1) (2) (1) (2) (5) 
]2 13 10 14 
9 2) 0 3 
(1) (2) (5) 
13 10 11 15 
0 O 3 
a2 ea)! Meal wm @ © 
10 14 12 16 
9 3 9 3 3 0 3 3 
(1) (2) (5) 
11 15 13 17 
3 3 9 
(1) (2)] (ty (2) (8) 
12 16 10 11 
9 12 3 9 9 
13 17 11 18 
(1) (2) 
10 1] 12 
3 0 


On the right, in each column, we have indicated the contents of the 
active cells. It is interesting to observe, in this simple example, how 
little computing is being done. 


4.3 Specification of HAC 


We go on now to the construction of our hypothetical automatic 
computer (HAC). HAC’s “memory”’ consists of 100 cells numbered 
from 00 to 99. Each cell is capable of retaining a “‘word,”’ which may 
be interpreted as a number or as an instruction. Each word is made up 
of seven decimal digits followed by a sign, and if the word is to be under- 
stood as a number, then the decimal point is at the left. Thus 


1000000 + = 107? 


5000000 — = —2-1 
0000000+ = 0 
0000000 — = —0 
0000001 + = 10-? 
9999999— = —1 +4 10-7, 


etc. Ifthe word is to be understood as an instruction, then the first two 


Google 


AUTOMATIC COMPUTERS 165 


digits are the « digits, the next two the £ digits, the next two the y digits, 
and the final one is the 4, or operation, digit, as follows: 


00 00 00 O + 
a 6B y oO sign. 
The actual instructions follow: 
Input-output instructions 
aByO+ Read into a,«a + 1,...,f8  (y irrelevant). 
aByl+ Print from a,« + 1,...,f8  (y irrelevant). 


Arithmetic instructions 


aBy2+ = (a) + (B) +y. 

apy 3+ (x) — (B) > 

aBby4+ (a) + (Bp) +y (low-order product). 

aBpy5S+ (x) + (B) +y  (unrounded high-order product). 
aBpy6+ (x) + (Bf) +y (unrounded quotient). 


Logical instructions 


aBy7+ (x) < (B) > y. 
a By8+ I(~)| < |(B)| + y. 


Ifx = (a), y = (B), then operations 2 and 3 actually produce {x + y}, 
{x — _y}, respectively ({x} stands for the fractional part of x). Thus, for 
example, if (01) = 5000000 +, (02) = 8000000 —, then the instructions 


Ol 02 03 2+ 
O01 O1 04 3+ 


produce 3000000 —, 0000000 + in cells 03, 04, respectively, and the in- 
structions 

O01 O1 03 24 

01 02 04 3+ 


produce 0000000 +, 3000000 + in cells 03, 04, respectively. This loss 
of digits on the left is termed overflow. No such loss is possible with 
operations 4 and 5. ‘Thus, for example, if (01) = 5000000+, 
(02) = 9999999 +, then the instructions 


Ol O1 03 5+ 
02 02 04 4+ 
02 02 05 5+ 


produce 2500000 +, 0000001 +, 9999998 +, respectively, in cells 03, 04, 
and 05. The low-order product is the one to use for computing with 
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whole numbers, which are thought of as stored with a scale factor 
10-7 = «. 

In operation 6, HAC halts if («) = (8) and does not perform the in- 
struction. If (01) = 0100000+, (02) = 0600000 —, then the instruc- 
tion 

01 02 03 6+ 


produces 1666666 — in cell 03. 

The logical operations 7 and 8 are struightforward and need no com- 
ment. 

The input operation 0 and the output operation 1 have no counter- 
parts in our earlier discussion. Here we provide for moving data into 
and out from HAC’s memory. Notice that the y digits are irrelevant. 
If 8 < a, then HAC will read or print one word intoorfromcella. If 3 
> a, then HAC will read or print into or from cells a, « + 1,..., pin 
this order. 

We make the convention that an instruction with a minus sign halts 
HAC, after the instruction 1s performed. 

We assume that HAC has a “‘start switch.’” When this switch is on, 
HAC reads a word into cell 00 and goes there, so that (00) will be inter- 
preted as an instruction and performed. Thus, suppose that we wish to 
read a program into HAC’s memory that begins in 10 and ends in 40, 
the first instruction performed by the program being theonein 20. The 
actual program should be preceded by the instructions 


O01 02 00 O+ 
10 40 OO O+ 
00 O1 20 7-. 


We return to this point later. 

HAC treats 0 and —0 as identical numbers. 

In planning our programs, we follow the convention that 00 will be 
used only as a “‘starter,” 01 to 09 will be used for temporary storage, and 
10 to 19 will be used for “standard constants,” which we list below: 


10 0000000+ = 0 

11 QO000014+ = 10" =e 
12 0000010 + 
13. 0000100+ = 10-° 

14 0001000+ = 10-4 

15 0010000+ = 10-3 

16 0100000 +- = 10-? 

17 1000000+ = 107 

18 5000000+4+ = 27! 

19 9999999+ =1—10"%=1—e. 


| 

pot 

o 
& 
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4.4 Worked Example: Cyclotomic Numbers 
We are going to construct a program for HAC which will compute and 
print the 2 cyclotomic numbers, for a series of values of n, 


a. l<r<n, C= exp, 


The simplest scheme to follow would be to employ the recurrence 
formula 


gti=¢- cs 


so that the only significant computation involved would be that of ¢. 
This, however, has the disadvantage that the round-off error incurred 
by the multiplications tends to accumulate and could be serious enough 
to ruin 2’ for values of near n. Instead, we adopt a procedure which 
makes the computation of the 7th number independent of the preceding 
ones. We must also remember that HAC rejects numbers >1 in ab- 
solute value. ‘The steps (in broad outline) follow. 

Step 1. Read in word. Ifthe word is a legitimate number a, go to 
step 2; otherwise halt. 


Step 2. Print n. 

Step 3. Compute 0 = a/4n. 

Step 4. Setr = 1, 6, = 6, where 0. = r@. 

Step 5. Compute cos 6,, sin 6,. 

Step 6. Compute cos 86,, sin 86, by repeated use of the identities 
cos 2x = cos? x — sin? x, sin 2x = 2 sin x cos x. 

Step 7. Print r, cos 86,, sin 86,. 

Step 8. Ifr <n, go to step 9; otherwise go to step 1. 

Step 9. Replace r by r + 1, 6, by 6,,,, and go to step 5. 

An examination of this schedule reveals that step 5 is the only one in- 
volving serious computation, and accordingly we concentrate our atten- 
tion here. We need routines for cos x, sin x where x is a HAC number. 
Put 


x2 

Co 7 l, C, = 2k (Qk a 1) Chay, k = l, 
x2 

So a x, Sy = 2k( Qk as 1) Seow k a l, 


Ay. = Go Oya eas Cy, k >0, 
Be Sy Sp a ES, k >0. 

Then cosx = A,,+&, A >1, 
snx = 8 4,+ 5; k>1 
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where |c,| < |C,|, |s,| < |$,|. Also, 
x 
Se = OR +1 

and so |S,| < |C,|._ To compute cos x, sin x to a given precision £, it is 
necessary only that [C,| < E£. 

We write this as a subroutine, anticipating the possibility that we may 
want such a routine elsewhere. 

We now write out step 5 in somewhat more detail. Instead of 9,, we 
write x. 

Spode etk S21 5 7G ae eS a, Se a =, 
Bio By Sx: 

Step 5.2. Cy, = —{x?/[2k(2k — 1)]}C,_, replaces C,_}. 

Step 5.3. If |C,| < E, go to exit; otherwise go to step 5.4. 

Step 5.4. S, = —{x?/[2k(2k + 1)]}S,_, replaces S,_). 

Step 5.5. A,_, + C, = A, replaces A,_3. 

Step 5.6. B,, + S, = B, replaces B,_). 

Step 5.7. k + 1 replaces k, and go to step 5.2. 

For the detailed code for step 5, we assume that x has been put into 09 
and that the starting point is 20. 


20 41 10 05 24 Set 2ke = 2c in 05. 
21.19 10 O1 2+ Set C,_. =Cy = 1 — ein Ol. 
22 09 10 02 24 Set S,_, = S) = xin 02. 
23 19 10 03 24+ Set A,., = 4, = 1 — ein 03. 
24 09 10 04 24 Set B,, = By = x in 04. 


Cy 


—x? replaces x in 09. 


27°05 05 06 44+ (2k)2e > 06. 
28 06 05 07 3+ Dk(Qk — Lye + 07. 
29 06 05 08 2+ 2k(2k + 1)e > 08. 


30 11 O7 OF 64 x2 
31 07 09 07 5 -+- OF = — 2k — 1) Gis replaces Ceca 
32 07 O01 Ol 54 cami 


33 Ol 42 O00 8+ If |C,,| < E, go to exit; otherwise go on. 


34 11 08 O08 6+ x 
35 08 O09 O08 5+ Ss. = — —.—— §,_, replaces $,_)}. 
36 08 02 02 5+ canted 


37 03 O1 03 24 A,_, + C, = A, replaces A,_}. 
38 04 02 04 24 B,-, + S, = B, replaces B,_3. 
39 05 41 O05 2+ k + 1 replaces k. 

40 10 11 27 7+ Go to 27. 

41 00 00 00 2+ 2e. 

42 00 00 O00 5+ f= 5 107%, 
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One or two points should be noted. The original value of x has been 
destroyed, 09 containing —.x? after the subroutine has run its course. 
Instruction 33 is an “exit’’ instruction and must be set up before the sub- 
routine is entered. (This will be clumsy, since HAC lacks an “extract” 
order.) The nonstandard constants 2e and E (which was taken to be 
Se) are part of the subroutine. The cell 03 contains cosx and 04 
contains sin x at the end of the subroutine. 

It is generally better to write subroutines as if they originated in 00 
and then to “incorporate’’ them, but we disregard this refinement here. 
We return to this point in Sec. 4.21. 

The next part of the program which should be examined is step 6, 
since the identities are applied three times. There is a simple loop in- 
volved here, and step 6 in more detail follows (once again we write x 
instead of 6,). 

Step 6.1. Setk = 1. 

Step 6.2. cos 2x = cos? x — sin? x, sin 2x = 2 sin x cos x; replace 
cos x, sin x, respectively. 

Step 6.3. Ifk < 3, go to step 6.4; otherwise go to exit. 

Step 6.4. Replace k by & + 1, and go to step 6.2. 

Remembering that cos x, sin x are in 03, 04, respectively, the detailed 
code would be as follows: 


43 11 10 O1 2+ Set ke = 1- ec in Ol. 


cos 2x = cos? x — sin? x > 09. 


sin 2x = 2 sin x cos x replaces sin x. 


49 09 10 03 24+ cos 2x replaces cos x. 

50 41 O1 O00 7+ If 2 < k, go to exit; otherwise go on. 
91 Ol 11 OL 2+ k + 1 replaces k. 

92. 10 Il 44 7+ Go to 44. 


We have now written out all the difficult bits, and we can prepare the 
final HAC program with little difficulty. The question of inputting the 
program should be discussed somewhat more fully, however. A real- 
istic approach to this question involves the knowledge that any com- 
puter can make an error, and provision must be made for checking any 
data read into the computer. A simple way to do this with a program 
which will be read in more than once is to compute and print a 
“memory sum.” Thus a program which occupies cells A through C 
and has its first instruction in B should be preceded by the following 
instructions : 
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00 O01 08 00 O+ Read in 01 to 08. 
01 A C 00 O+ Read in program. 


Form memory sum in 00. 


06 00 00 OO 1+ Print memory sum. 
07 03 04 B 7+ Go to B. 
08 O1 00 00 0+ 1 in « position. 


Notice that nothing is set up, since the instructions will be performed 
just once and will be destroyed when the program begins. The fact 
that the summing cell 00 is not initially clear (empty) 1s irrelevant, since 
we are interested only in the constancy of the memory sum, not in any 
given value; and 00 contains the same number initially each time the 
program is read into HAC. 

The actual HAC program follows. Since we are comparing to zero 
to determine when no more sets of cyclotomic numbers are to be com- 
puted, the input for the computation of the numbers of orders 5, 7, 12 
might be 

0000005 + 
0000007 + 
0000012 + 
1234567 —. 


The last word is completely arbitrary, solong asitisnegative. Notice 
that the halt order 76 prints out this word at the end of the run. 

A good problem is to examine the program on page 171 for possible 
errors by “‘clocking it through”’ with a numerical example. 

Note that the “‘empty”’ cells 53, 5+, 55, and 56 will contain n, r, e, and 
¢,, respectively. 


4.5 Worked Example: Minors of Triple-diagonal Matrix 


We go on now to another example. Let .\f be the triple-diagonal 
matrix 


by ty 
a, bz Cy 
NSS ty. oS ag fog oe eee 
a,_2 by4 Cn-1 
a,1 5, 


with whole-number entries. We are going to compute and print the 
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running minors of MM, which are given by the recurrence formula 
Dy = l, D, —- b,, D, — b,D,-1 = eae ey Deere 2 a k < fi. 


The steps (in broad outline) follow: 

Step 1. Set Dp = 1, D, = b,, k = 2. 

Step 2. Ifn < k, go to step 5; otherwise go to step 3. 

Step 3. Compute and store D, = 6,D,-1 — 4¢-164-1Dx-2- 

Step 4. Replace k by k + 1 and go to step 2. 

Step 5. Print Dy, D,,..., D,. 

The significant computation occurs in step 3, and when we write this 
out in detail, it has something of the following appearance: 
Ae Meena i ee ay, yCy_yDy_2 + OL. 
[4,.] [Dy1] 02 4+ b,D,_, > 02. 

02 01 [D,] 3+ D, — storage. 


We see that a new featureemerges. Just asin forming a memory sum, 
the orders performing the computation are subject to change and will 
have to be set up and modified during the course of the computation. 
We have introduced here a common convention: enclosing instructions 
(or parts of them) which are subject to modification, in brackets [ ]. 
Since the computation starts with k = 2, the initial form of the orders is 

[4] [4] Ol 4+ 
[D,] Ol Ol 4+ 
[4,} [DJ 02 4+ 
02 01 


[D.] 3+. 
Assuming that 


a,isina +k, l<k <n—-]1 
b, isin b + k, l<k<n 
¢,isine + k, l<k <n— 1] 
D, isin D + k, 0O<k <n, 


the orders become 
a+k—1 c¢+k-—1 01 44+ 


D+k—2 Ol 01 4+ 
b+k D+k-1 02 5 
02 O1 D+k 34, 


and this initial form becomes 


at+l e+] 01 44 
D 01 Ol 44 
b64+2 D+] 02 4 + 
02 01 D+2 34. 


We can now write out the program. We assume that n is in cell N. 
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20 1] 10 D 2+ Set D, = 1. 

21 6+ 1 10 D+1 2+ Set D, = 4. 

22 11 11 09 2+ Set k = 2 in 09. 

23 41 10 28 2+ 

- oe . me a Set up 28, 29, 30, 31. 

26 44 10 3] 2+ 

27 N 09 38 7+ Ifn < k, go to 38; 
otherwise go on. 

28 a+k—1 c¢+k—I1 01 4+ 

299 D+k—-2 01 01 4+ Compute and store 

30 b+k D+k—-1 02 4+ D,. 

31 02 01 D+k 34+ 

32 09 1] 09 2+ k + 1 replaces k. 

33 45 28 28 2+ 

— oF ae 30 Dt Reset 28, 29, 30, 31. 

36 12 31 31 2+ 

37 10 1] 27 7+ Go to 27. 

38 N 15 08 6+ 

39 46 08 40, 34+ Set up 40. 

40) D Din 00 ]— Print Do, D,,...-, 
D,, and halt. 

4] a+] c+] 01 44 

42 D 01 01 44 

43 b+2 D+1 02 4+ 

44 09 0] D+2 34 Constants. 

45 01 01 00 0+ 

46 D D 00 ] — 


We do not bother to assign actual numerical values to a, 6, c, D, N or 
to write a complete program. It should be noted that the actual con- 
tents of cells 28, 29, 30, 31, and 40 are irrelevant. 

In 38 we use a division to “shift” a number. More sophisticated 
machines do this with a special operation. We could have accomplished 
the same thing by low-order multiplying by 1000.e. 


46 Problems: Generation of Partitions and Permutations 


The reader is invited to prepare the detailed HAC program for the 
generation of all the partitions of a positive integer n into precisely k 
parts. We denote the parts occurring in such a partition by x,, %.,..., 
x,and assume thatx, >x, >-::: >x, >J1. Thenn =x,+%x%,+ °°: 
+ x,, and so kx, <n. The outline for the program follows: 

Step 1. Set r = 2. 
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Step 2. Sexe =x te =x, I, 

Step 3. Compute S = x, + x. + +++ + x,y. 

Step 4. Ifn < S, go to step 9; otherwise go to step 5. 

Step 5. Replace x, by x, +n —S. 

Step 6. Print x,, %9,... Xp 

Step 7. Set r = 2. 

Step 8. Replace each of x,, x.,..., x, by 1 + x, and go to step 3. 

Step 9. Ifn < kx,, halt; otherwise go to step 10. 

Step 10. Replace r by r + 1 and go to step 8. 

This outline should be clocked through with specific numbers for 
Kandn,sayk = 4,n = 10. ‘The programming logic here is much more 
subtle than in any of our previous examples, and the reader should con- 
vince himself that the routine is correct. 

As a final “‘challenge problem,” the reader is invited to prepare a pro- 
gram to compute the n! permutations of the integers 1, 2,..., 2 where 
n <7. The permutations must be run through once and once only. * 


4.7 Conclusion 


In conclusion, we remark that HAC is actually a modified three- 
address machine. Most present-day computers are one-address ma- 
chines with accumulators. The programming logic, however, 1s indepen- 
dent of the machine, although the detailed code very much depends on 
the machine. In fact, to become a skillful programmer and coder, one 
must know a particular machine intimately. 


SEQUENCE OF CODING PROBLEMS 


4.8 Introduction 


We now assume a familiarity with the basic concepts of programming 
and coding. Wesketch asequence of problems for programming on any 
general-purpose computer, which will develop some basic elementary 
knowledge. The approach is almost historical. The instructor will 
have to clothe the skeleton we provide according to the equipment he 
has available and to emphasize or deemphasize sections according to his 
audience. 


4.9 Memory Summation 
The most naive program to compute > (a), disregarding overflow, 
will include a tally which counts from 1 tom —n + 1, in some units: 
0. Clear counter for partial sum, clear tally. 
1. [Add a new term to the partial sum.] 


2. Compare tally with limit: if less, carry on; if equal, go to 6. 


* See, for a solution, D. H. Lehmer [41]. 
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3. Advance tally. 

4. Advance variable instruction (i.e., 1). 

5. Go to 1. 

6. Print total and stop. 

A little thought shows that the variable instruction is changing with the 
tally and that the variable instruction can be used as a tally, being com- 
pared with its ultimate form. Our program can be revised as follows: 

0. Clear counter for partial sum. 

1. [Add a new term to the partial sum. ] 

2. Compare variable instruction with its final form: if less; carry on, 
if equal go to 5. 

3. Advance variable instruction. 

4. Goto l. 

5. Print total and stop. 

Comparison of these two programs will show that the latter is more 
compact and that it saves one addition per cycle; instead of the cycle 
1+2 +3 —-4-—-5 —1, we have 1 ~2 +3 >4—> 1. 

It is instructive to make time estimates for problems as early as possible and to 
check these whenever the problems are run. In this way a body of knowledge will 
be accumulated which can be applied directly in new situations, or used as a basis 
for extrapolation to them. We do not return to this point explicitly again, but it 
is relevant in practically all the problems of our sequence. 

The sequence of instructions “obey, compare, advance”’ is one which 
is always occurring; it is usually safer to prefix this sequence by another 
one, setting the variable instruction to its initial state. Our program 
would then read: 

0. Clear counter for partial sum. 


‘Set 1. Set variable instruction to its initial state. 
obey 2. [Add a new term to the partial sum.] 
compare 3. Compare variable instruction with its final form; if 
less, carry on; if equal, go to 6. 
advance”’ 4. Advance variable instruction. 
5. Go to 2. 


6. Print total and stop. 

We can elaborate this into an equipment test program ofa form which 
was very necessary in the early days, when deterioration of the storage 
presented problems. We imagine the following situation: most of the 
storage a <n < f is filled in any manner, and the rest is to contain a 
test program with the following characteristics. Compute and record 


p 
the sum > (2); compute again and compare with the recorded sum; if 
nm=a 


there is a discrepancy, print out both sums and stop; if there is agreement, 
recompute, recompare, etc. Ifthe machine is functioning perfectly, no 
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print-out will be observed—the same may be true if the machine is func- 
tioning very imperfectly. It is therefore desirable to modify our speci- 
fication by requiring a periodic print-out of, for instance, the latest com- 
puted sum and the number of summations carried out. The actual 
frequency of print-out will be determined by the speed of the machine, 


Set N=0, set Np 


Set variable instruction to initial state 
Add(n+1)st term to partial sum 


Advance variable instruction 


N*0(100) N=0(100) 


| [Pats 


Fic. 4.1 Memory sum test program. 
(The variable instruction is ““Add (n + 1)st term to par- 
tial sum’’; the setting of this to its initial state, 1.e., ““Add 
(ax) to partial sum,’’ should be accompanied by the clear- 
ing of the cell reserved for the partial sum). 


Print S,, Sy, N; stop 


and the duration of the test will be specified by the total number of 
summations to be carried out. Suppose that we require Ny summations 
altogether and that s, is printed out each time N is a multiple of 100. 

Instead of the linear arrangement used for earlier programs, we now 
indicate the program by a “flow diagram.” 

Many elaborate schemes for flow diagrams, bristling with conventions, 
areinuse. Inthe present account we use only the most obvious conven- 
tions and do not attempt to give any formal description. We note that 
a flow diagram is largely independent of the machine contemplated for 
the solution of the problem. Further, the preparation of a flow dia- 
gram, or something equivalent, is an essential preliminary to the pro- 
gramming of problems of significant size; such diagrams can be used for 
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subcontracting the work and especially for the detailed checking of pro- 
grams prepared by others (see Burks, Goldstine, and von Neumann 
[38]). 

The usefulness of memory summation is underlined by the fact that 
machines have been built which have such a summation as a basic in- 
struction. This fact raises the general question of the economics of sim- 
plifying programming by complicating equipment. It is not feasible to 
discuss this question here. 

Let us return, for simplicity, to the naive inductive program. It is 
certainly compact, but it is much more time-consuming than a simple 
linear program of the form: 

0. Clear register for partial sum. 

1. Add (n). 

2. Add (n + 1). 

(m —n-+1). Add (m). 

(m —n). Print total and stop. 

The heartbreaking sequence of “‘compare, advance, obey’ can be 
avoided at the expense of more complicated equipment. One way is to 
use an “index register,”’ or ““B register,’’ a simple form of which will now 
be described. This 1s a special register in which the number of iterations 
required is set initially and is then decreased after each iteration is carried 
out. The equipment is arranged so that, as long as the index register is 
positive, one action is taken, whereas if it becomes zero, another action is 
taken. This avoids the repeated comparison order, at the expense of a 
single setting of the index register. The structure of the program in this 
context would be of the form: 

. Clear register for partial sum. 

. Set index register. 

[Add a new term to the partial sum. | 

Decrease index register: if positive, carry on; if zero, go to 6. 
Advance variable instruction (1.e., 2). 

Go to 2. 

Print total and stop. 

Note that here the summation is carried out backwards. 

Other uses of the index register are discussed in Sec. 4.21. 


Do PWN OS 


4.10 Square Root 


We now discuss the evaluation of square roots. Some machines have 
a single instruction which produces the square root; this is rather un- 
usual, as this facility requires additional equipment. We therefore con- 
sider the determination of square roots by programming. Several 
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algorithms can be considered. For instance, the elementary-school 
method, indicated below, can be readily mechanized. 

7 +0 7 1 0 6 

20 00 00 00 00 00 


7 49 
140 1 00 00 
1407 98 49 
14141 1 51 00 
1 41 41 
141420 9 59 00 00 
1414206 


A little thought will show that this is a comparatively slow process and 
that the iterative processes discussed in Chap. 2 are to be preferred. We 
recall the results of the analysis of the scheme 


x = 1, Ant. = Ya(Xn + Nx,7?). 


The sequence {x,} decreases steadily to VN for any N,0 < N < 1, and 
the convergence is quadratic. 

Some of the points which require 
consideration have already been dis- 
cussed. A possible program is indi- 
cated by the flow diagram alongside. 

The check that N > 0 is to pre- 
vent nonsense from developing. 

Note that, if we have a machine 
which recognizes only numbers 
strictly between —1 and +1, then 


we cannot set x, = 1 but must use 
int x.4.,; exi : 
ae, X% 9 = 1—e, where e is the smallest 


Fic. 4.2 Square-root program. positive number recognized by the 
machine. Again, in order to avoid 


Print N; stop Set x, =x9=1 


Xi 415 5 (x, +Nx;") 


overflow, we must compute 
sin. = Yox, + (AN)x,>, 


and even then we must be sure that in case, for example, N = 1 — e 
we do not get an overflow on the addition; moreover, we must be sure 
that the division is invariably proper, that is, thatO <x, > '’N. The 
detailed analysis of this process is quite complicated and depends vitally 
on the detailed behavior of the arithmetic unit. For the discussion of 
special cases, we refer to the work of Householder and Rumsey cited 
earlier. We note that, although the recurrence relation is theoretically 
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stationary only ifx, = V N, anefficient practical version of the algorithm 
can give the exact result when WN is a perfect square. 

We conclude with these remarks. If one does not make a complete 
analysis of the program, a rough estimate of the error can be made, and 
after the alleged square root is obtained, the difference |x?,, — N| can be 
compared with a specified tolerance and the calculation can be stopped 
if the error is excessive. 

It is clear that a program of the kind described is a finite one; it is im- 
portant that itslength beestimated. Inviewofwhat happens for N = 0, 
it might be worthwhile to separate out this case. 


It is also appropriate to discuss what is a reasonable precision for V N, 
it being assumed that VN is a rounded number. The schoolroom al- 
gorithm indicated that V N is determined to about half the number of 


significant figures of N; this is confirmed by noting that, if f(x) = Vx, 
then 
Of == Vax-* bx. 


4.11 arctan d, |d| < 1 


We discuss the calculation of arctan d, as representative of the calcula- 
tion of the elementary transcendental functions. Among the competing 


algorithms are the following: é 
1. The Gregory series 
arctan x = x — 4x3 + Yox5 —---, (4.1) 


2. The continued fraction 
ET ae lily CLP ed ee 
1+3+ 54+ 7+ 
(see Teichroew [39]). 
3. Ifa, = 1,6, = V1 + x? and if forn = 0, 1.2,---, 


Qn41 = Ya(d, We b,)s bn+1 ie Vb Anat 
we have 
lim a, = lim 6, = x/arctan x. 

4. Chebyshev expansion (see Chap. 3, Hastings [40], Clenshaw 
[12], NPL 5). 

We shall not make a comparative evaluation of these but shall discuss 
the simple Gregory series method in some detail. It is known that the 
series (4.1) converges for |x| < 1, but very slowly when |x| is near unity. 
One method of improving this is to choose a central value p of 
@ = arctan x and compute 
x — tan p 


§ = arctan z z= ————_.. 
ae 1 +x tan p 
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We shall assume 0 <x <1. Then z varies from —tan p to (1 — tan p)/ 
(1 + tan p), and it would appear advantageous to choose p so that the z 
interval is symmetrical about the origin. This means that r = tan p 
must satisfy the quadratic equation r* + 27 — 1 = 0, which gives 


7 =V2—1=.41421 35624, = arctan t = .39269 90817. 


An unsophisticated program will have the following form: 
. Compute x = |d|, z, —z*; put 2, = Z = Tp. 

. Compute 7,,, = —2°(21 + 1)7,/(2¢ + 3). 

. If 7,,, = 0, go to 7; if r,,, 4 0, go to 4. 

Compute the partial sum 2 ,,, = 2, + 7;. 

. Advance index, 2: + 3 ~ 2: + 1. 

Go to 2. 

. Compute 6 = p + &,. 

. Compute arctan d = (sign d) 6 


4.12 The Monte Carlo Method—Motivation 


When the exact answer to a statistical or physical or mathematical 
problem appears inaccessible, a statistical answer may be acceptable. 
Such an answer, ideally, is qualified by some such statement as that the 
probability of its being in error by an amount « is less than 6; very often 
this qualification is missing or vague, and the poser is content with the 
fact that an apparently satisfactory model of his problem is being used. 

Let us consider, for orientation, a problem which can be readily solved 
analytically. ‘Two gamblers begin with z and (a — z) dollars; they toss 
a coin repeatedly, each time for a stake of 1 dollar, and we assume that 
the first player has a probability of winning and that the second has a 
probability g of winning, where p + ¢=1. Wewanttoknow the proba- 
bility that the first player will be ruined (1.e., lose all his capital), and 
the expected number of tosses before this happens. The results are: 


CON HO PON = 


Condition Probability of ruin Duration 


p=q=% 1 — z/a z(a — 2) 


(q/p)* — (q/p)* z _a_ 1 — (gp)? 
(q/p)* — 1 q-p g-—pl — (q/p)* 


pq 


For a discussion of this, see Feller [11]. We call a game a sequence of 
tosses which leads to the ruin of one or the other player. We consider 
playing a large number of games, noting the relative frequency P of wins 
by the first player and the average number of tosses ina game. These 
provide approximations to the quantities desired in virtue of the law of 
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large numbers for Bernoulli trials. It is clear that these approximations 
can be obtained conveniently on a computer, provided we have some 
sort of ““chance device”’ in our computer: an instruction which can cause 
one action with probability p and another with probability g. Machines 
have been constructed with such instructions, but the use of genuine 
chance elements is not entirely satisfactory (e.g., we cannot repeat cal- 
culations for checking purposes). The construction of deterministic 
arithmetic processes which have a suitable behavior has been studied 
experimentally with considerable success. 

In the present context it will be enough if we can produce “pseudo- 
random numbers” which are “‘uniformly distributed” in [0,1] and make 
our discrimination according to whether the current pseudo-random 
number is <p or >p. What we want is a sequence of numbers ,,, 
0 <,7, <1 such that, if N, is the number of 7,, 7,,..., 7, contained in 
any interval of length « included in [0,1], then N,,/n is approximately a. 
In the next section we discuss one way of generating and testing such a 
sequence. | 

It is appropriate to quote here the definition given by D. H. Lehmer 
[Harvard 26] of a pseudo-random sequence. It is “a vague notion 
embodying the idea of a sequence in which each term is unpredictable 
to the uninitiated and whose digits pass a certain number of tests, 
traditional with statisticians and depending somewhat on the uses to 
which the sequence Is to be put.”’ 

For a survey of this subject, with an extensive bibliography, up to 
1954, see Meyer [52]; see also NBSAMS 12. Among the more 
recent papers are Bauer [15], Davis and Rabinowitz [16], and 
Hammersley [51]. A report on applications in nuclear physics has 
been written by Fortet [56]; considerable work in this area has been 
carried out by Richtmyer. 


4.13 Generation and Testing of Pseudo-random Numbers 


We begin with a scheme suitable for a decimal machine, with 10D. 
We write 


7,=1, Fea, = pi,(mod 10"), pp = 7° = 0040353607 


and then take 
t= 107. 


The barred quantities are integers, and 7,,, is got by taking the last 10 
digits of p7,, that is, disregarding multiplesof 10!®. It can be shown that 
the sequences {r,}, {7,} each have period exactly 5 - 10’. 

The actual generation of the sequence 7, is trivialon a machine. We 
take 7, ,, as the less significant half of the double-length product ofr, and 
p = 10-1. 
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The determination of the period in the decimal cases is rather involved. 
We digress to give the discussion in a binary case. We shall show that 
the sequence defined by x, = 1, x,,,; = 5x,(mod 2") has period 2*~*. 
The question is, What is the least integer M for which 5” = 1/2*}? 
Suppose we have 5” = 1 and that we have, say, M =r: 2°, wherer 
is odd. Then we can show that 5% = 1(2*). For 


ee [5¢r-2" AS sre ook Be 1}(5?° a1) 


In the first factor, there is an odd number of terms, and each one of them 
is odd; so 2% must divide the second factor. 

We now restrict ourselves to the case M = 2° and show that 52% — 1 
is divisible by 25+? and by no higher power of 2. We observe that 

52) — | = (5% * — 1)(5% 
= (52 _ 1)(5? 


8 1 


+ 1) 
+ 1)(5? 


2 8-1 


+1) 


= (5% — 1)(5% + 1)(57 + 1)(5% + 1)--- (5% + (21). 
The first factor is 2?, and the highest power of 2 that divides each of the 
remaining S factors is the first. 

Thus, the least M for which 5” = 1(2*) is 2*~?. 

In the case of current machines the word length is some 30 to 40 bits; 
this means that period is of the order of 10° to 101). 

We now discuss a simple frequency test for these pseudo-random num- 
bers. We propose to generate the first thousand of these and to record 
the number which lie in each of the intervals [0,.1], [.1,.2],..., [-9,1]. 
We generate the next thousand and record the frequencies, and so on. 
The significance of the lack of uniformity in the distribution can be eval- 
uated, for instance, by using the ,? test. 

A convenient way of handling this is to record the count in the zth in- 
terval in cell m + 1 and then to modify a dummy instruction, “Advance 
the count in cell m,” by adding to it the first digit (suitably shifted) of the 
current pseudo-random number to get the appropriate instruction. 

In a test the observed frequencies were 111, 95, 95, 101, 96, 94, 105, 
109, 106, 88. This result gives a value of vy? = 5.10, whereas for 9 
degrees of freedom a significant departure from the expectation, at the 
5 per cent level of significance, occurs if y? > 16.9 or if y? < 3.3. 

For a more detailed account of this method and others, and their 
testing, together with detailed references, see Taussky and Todd [8]. 


4.14 Gambler’s-ruin Problem 


We now sketch a program for the “solution” of the gambler’s-ruin 
problem by the Monte Carlo method. We consider playing sequences 
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of 1000 games, up to a total of N*; we denote by N the number of games 
played, by S the current total number of tosses, and by L the number of 
times the first player has been ruined. We shall print out N, L/N, S/N, 
and r, the current pseudo-random number, after each 1000 games. 

. Clear counters N, L, S. 

. Set data N*, 1, p, a. 

. Set datum z. 

Generate a random number. 

. Advance S. 

Ifr, <p, go to 6; ifr, > p, go to 13. 

Advance Zz. 

If z = a, go to 8; if z # a, go to 3. 

. Advance N. 

. If N = 0(1000), go to 10; if N 4 0(1000), go to 2. 

10. Print N, L/N, S/N, r. 

ll. If NM = N*, goto 12; if N #4 N*, go to 2. 

12. Stop. | 

13. Decrease z. 

14. If z = 0, go to 15; if z 4 0, go to 3. 

15. Advance WN and L. 

16. Go to 9. 

The following are the results of some experiments: 


OONIAMMAPwWN— oO 


a=10, z=5, N=1000, L/N=.515, S/N =24. 
, a=10, z=6, N=1000, L/N=.422, S/N =25. 
55, a@=10, z=5, N=1000, LIN=.272, S/N =25. 


Comparison of these results with the theoretical results gives us addi- 
tional confidence in the quality of our pseudo-random numbers. 


4.15 Normal Deviates 


In some problems it may be necessary to have available random num- 
bers from a normal distribution, rather than from a uniform one. A 
practically convenient method for doing this makes use of the central- 
limit theorem: if x,, x.,..., x, are chosen independently from a stand- 
ard uniform distribution, then the distribution of the mean ¥ = (x, + 
-++ -+ x,)/k approaches the normal distribution as k + oo. In practice, 
it is found that the approximation given when & = 10 is sufficient for 
many purposes. For other schemes, sce, for example, Votaw and 
Rafferty [19]. 

It is suggested that this scheme be tested by generating 1000 pseudo- 
normal deviates by adding pseudo-random numbers in groups of 10 and 
obtaining their frequency distribution. 
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T'wo such tests gave the following counts: 


0, 0, 13, 117, 360, 380, 117, 13, 0, 0 
0, 0, 12, 127, 355, 368, 125, 13, 0, 0. 


4.16 Youden’s Problem 


The following problem was proposed by W. J. Youden. A sample 
of 2 numbers x,, *2,...,%X, is drawn from a normal distribution—in 
practice, the casesn = 4,5,..., 10 are ofinterest. Suppose the num- 
bers have been labeled so that x, < x» <::- <x,. Let # = (x, + x, + 
-++ + x,)/n. It is required to find the probabilities. 


De Ses k= ats a) OY eee eee 


Since the problem was raised, an explicit solution was found by H. T. 
Davis in the case n = 4: 


pe* = 6[arccos (—3) — ben]/7 = .649, pf = ps jal = ps: 


The devices which we now possess enable these probabilities to be 
estimated experimentally. Take the casen = 4. We generate groups 
of four normal deviates by averaging sets of 40 pseudo-random numbers. 
We arrange the sets of four in order and observe whether the mean lies 
in the center or in the outer intervals. We repeat this process and ob- 
serve the relative frequencies of the three events. 

The following estimates were obtained by I. A. Stegun. 


No. of samples n py" po” bs p,” ps" fi.” 
7000 4 17 .65 17 
7000 5 .048 .448 .457 .048 
7000 6 012 .220 2535 .223 Ol! 
5000 7 .0028 .087 .408 .406 .092 .0036 


For a discussion of another, similar, problem, see Scheid [37]. 


4.17 The Dirichlet Problem 


The two problems discussed so far—the gambler’s-ruin problem and 
the Youden problem—have been statistical, and the Monte Carlo treat- 
ment was evident. It is, however, possible to embed a mathematical 
problem in a statistical framework and then solve it by a Monte Carlo 
method. We discuss a simple case of the classical Dirichlet problem. 

Find V = V(x,y), given that V,, + V,, =Ofor0 <x <10<y¥y <1 
and 

V(x,1) = sin mx V(x,0) 
V(0,¥) 


> = 0, 0 <= x <_ l, 
: Vil,7) = 0, O<y<l. 
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We shall obtain an estimate, not of V, but of the solution u to a corre- 
sponding discrete problem. Consider a lattice on the unit square with 
sides h = (/ + 1)~}, for some integer /. We replace the differential 
equaticu for V by a partial difference equation for u(m,n) = u(mh,nh): 


u(m —1,n) + u(m,n —1) — 4u(m,n) + u(m,n +1) +u(m+1,n) =0 
Mona, 265 eed, 
u(m,/ +1) = sin wml, u(m,0) = 0, m=0,1,...,/+1, 
u(O,n) = 0, u(i+1,n) =0, n=0,1,...,/4+1. 


In this simple case, both u and V are known explicitly: 


V(x,y) = sin wx sinh zy/sinh z, 
u(m,n) = sin wmh sinh Amnh/sin sinh Ar, 


where A is given by 
sinh Arh = sin anh, 


which gives A = .9968 for / = 15. 

An estimate for u(P,), where P, is an interior lattice point, can be ob- 
tained as follows. Imagine a particle beginning at P = P, of the lattice 
and continuing in the following way. If at any time it is at P(a,8), then 
it moves at the next instant to (a, 8B — h), (a —h,B), (a +, B), (a,8B +h), 
each with probability %. When it reaches a boundary point Q, a score 
u(Q)is made. The process is then repeated, starting at Py. The prob- 
lem is, What is the expected value of the score? It can be shown that 
the expected value of the score is u(P,). 

In practice, this is estimated as the arithmetic mean of the score after 
N walks, as N — oo. The dispersion of this mean has been examined; 
in order to get m decimal places correct, about 4 x 10?" walks are 
needed (see Curtiss [18]). 

Granted a source of pseudo-random numbers from a uniform distri- 
bution, the above process can be readily carried outonacomputer. It 
is not essentially different from the gambler’s-ruin problem discussed 
earlier. The choice of the direction of a step is determined by finding in 
which of the intervals 


[0,%4), [%4,72), [%,%), [%,1] 


the current random number lies. The appropriate counter (one for the 
x coordinate and one for the y coordinate) is adjusted, and it is deter- 
mined whether a boundary has been reached. 

The construction of a flow diagram is left to the reader. There 
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follows a summary of results of some experiments, carried out on a binary 
machine, with walks on a 16 x 16 lattice (see Todd [3]). 


x= , y= Differential equation .1993 
Difference equation .2002 
Monte Carlo, 6592 walks .2014 
Y= 4, y= Differential equation 3201 
Difference equation .3209 
Monte Carlo, 2176 walks .3001 
x=%, y=" Differential equation .0532 
Difference equation .0534 
Monte Carlo, 13440 walks .0530 
x=, y=h Differential equation .0763 
Difference equation .0766 


Monte Carlo, 10368 walks .0807 


Among the theoretical studies of this problem, we note those of Wasow 
[53-55]. 


4.18 Polynomials and Polynomial Interpolation 


(2) We have already indicated (Chap. 1) that a simple, efficient 
method of evaluating a polynomial 


S(%) = yx" + ay_yX"—* + +++ + ax + Ay 
is by the recurrence relation 
Jo = 4s J iti 4n-itit Diy 1=0,1,...,2—1, 


which gives 
In =F (2). 

The construction of a subroutine to evaluate f(x) where f is a poly- 
nomial of fixed degree or of arbitrary degree is formally easy. In the 
latter case the degree can be set as a parameter to stop the induction, or 
a special flag can be put at the end of the list of the a,’s to serve the same 
purpose. In general, however, overflows can occur, and this must not 
be allowed. Either we check at each stage to see whether the addition 
produces an overflow and make appropriate arrangements if it does, or 
we make a preliminary analysis of the problem and make appropriate 
changes of scale at the beginning. For instance, if we want to compute 
f(x) for x in the range (—10,10) on a machine recognizing numbers in 
(—1,1) only, we can proceed as follows. We note that 


F(X) =a,X" +4, ,X" 1 + +++ + a,X + a, =f(x)/n- 10" 
if x=10X; %&=10'"a,/n, 1 =0,1,..., 2; 4 Sn. 
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It can be shown that no overflow can take place in the evaluation of 
F(X). Ifwe choose fi to be a power of 10, then f(x) can be read off by an 
adjustment of the decimal point; otherwise we have to multiply (by a 
scaled #) and then adjust the decimal point. 

(5) We have noted the possibility of polynomial (Lagrangian) inter- 
polation as a method for approximation of functions given by a table. 
Probably the most convenient method of handling this is by use of the 
Aitken algorithm. This is left as an exercise. The subroutine should 
have as a parameter the order of interpolation to be used and the address 
of the first of the given arguments and of the given values, each of which 
may be supposed stored in consecutive cells. 

(c) A subroutine for the manipulation of differences has many uses. 
It should, in the first place, compute and list the early differences of a 
table of values f(a, + nh), n = 0,1, 2,..., stored in consecutive cells. 
This listing would be used for a preliminary study of the behavior of the 
function. After this study, it should become clear what is a reasonable 
number of decimals to retain, what is a reasonable number of differences 
to give, and whether or not they should be modified (Chap. 2). The 
subroutine should have an optional entrance to produce a final copy of 
the table, in standard format. 


4.19 Special Devices 


(a) Wé have noted the power of the Euler process (Chap. 2). It is 
useful to have a subroutine which carries out this process on a series, the 
terms of which are either stored in a specific place or generated by a 
specific recurrence relation (see Rosser [34]). 

(6) We have noted the power of the Aitken 6? process and also its 
dangers (Chaps. 1, 2). It is useful to have a subroutine which carries 
out this process (see, e.g., Todd and Warschawski [35], Henrici [43]). 

(c) The dangers of using recurrence relations for the generation of 
functions are considerable. We have discussed their use in the genera- 
tion of Bessel functions in detail in Chap. 2. A practical study of this 
scheme will be found rewarding. See Stegun and Abramowitz [32,33] 
and Gautschi [49,50]. 


4.20 Sorting 


We discuss, as an elementary type of combinatorial and data- 
processing problem, the rearranging of a given set of numbers 4,, dg, 
...,4@y, arranged in consecutive cells, in ascending order in the same 
cells, that is, as 5,, b5,..., by where 6, <b, < +--+: < by. 

One solution to this problem is the following. Assume that the first n 
have been arranged in order. Remove a,,, to a temporary position 4. 
We determine whether a,,, >0,. Ifa,,, > 6,, we replace a,,, as 5,,,, 
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and we have now the first n + 1 in order. However, if a,,, < 6,, we 
have to determine the position of a,,, in the sequence 4,, 4,,..., 6,.- 
To do this, we move 4, to the cell occupied by a,,, and rename it 4,_,. 
We then determine whether a,,, < 6,_,;. Ifa,,, => 5,_), we puta,,,in 
the nth cell and call it 6,, and we then have the first n + 1 in order. 
However, if a,,, < 5,_,, we have to carry on; that is, we have to move 
b,,_, to the nth cell, callit 6,, etc. This backing up will end with a, . , as 
b, ifa,_, < a, or with a,_, occupying some intermediate position. This 
completes the discussion of the inductive step. 

In translating this into an actual program, it may be found convenient 
to assign fictitious elements + oo after and before the a;. The successive 
stages in the process are indicated in the following example: 


— 3 0 -—2 5 ] oO 
— © 0 3 —2 5 ] oe) 
— © 0 —2 3 5 l 00 
—ao -—-—2 0 3 5 ] co 
—o —2 0 3 ] 5 cO 
—o —2 0 l 3 5 oO 


The time to carry out such a sort depends on the disorder of the data, and 
a maximum time can easily be calculated. With an efficient code, the 
maximum time for N = 500, on a millisecond machine, will be a few 
minutes. 


4.21 Construction of a Program 


The programs that we have discussed are far from typical, but some 
are likely to be of use in building up more representative programs. 
Consider the following two problems: 

1. Obtain the numerical solution of the differential equation 


yoVvxt vy,  9(0) =0, 


for x satisfying 0 <x <1. 

2. Prepare a table for conversion from rectangular coordinates to polar 
coordinates. Specifically, find r= V x? + »?, 6 = arctan (_y/x) for x = 
1(1)9, y = 1(1)100. 

Disregarding for the time being the precise method of solution, we ob- 
serve that the square-root and arctan routines prepared earlier can be 
reused. However, it is most likely that the original locations in storage 
will be unsuitable. It is, of course, possible to rewrite the routines in the 
location desired; this will only require adding a constant to various ad- 
dresses in the instructions. To do this clerical operation manually 
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provides an opportunity for error which should and can be avoided by 
arranging for the computer itself to do the relocation. We propose to 
describe how this can be done. 

We imagine a program built up as follows. On the input medium 
(paper tape, magnetic tape or wire, or punched cards) we write the new 
parts of the program and copy mechanically the required subroutines 
from the masters. We arrange for the new program to get into 
storage with the subroutines in the desired places but with wrong 
addresses. 

What we have to do 1s to design a program which relocates the sub- 
routines, adding constants to certain instruction words to correct the 
addresses and leaving other and, in particular, “arithmetic”? words un- 
touched. This will be possible only if the words to be altered can be 
distinguished in some way. How this is done depends greatly on the 
fine structure of the machine, and from now on we shall be necessarily 
rather vague in order to be general. 

It can happen that certain digits in instruction words are superfluous. 
(For instance, original plans for a 10000 word storage may have been 
changed to 1000 words, so that, say, a decimal digit is left free. Insert- 
ing a digit in this position in the words which are to be altered is a satis- 
factory distinction. Again, the sign in an instruction word may not be 
sensed by the control, and the signs can be used to distinguish between 
two classes of instructions.) Let us suppose that all our subroutines 
are written with markers in the instruction words which have to be 
altered. 

For simplicity, let us assume that the subroutines to be relocated are 
placed in the memory starting at cells Mf, and that they are written as if 
they started in 0000, so that M, has to be added to the addresses of the 
marked instructions in the first case, M, in the second case, and so on. 
We can call the list M,, M,,... the “‘directory.””> With each M, there 
will be listed the number of instructions in the :th subroutine which have 
to be processed. An outline of a relocating program follows. 

1. Set up variable instruction 2. 

2. [Obtain the next word in the directory.] 

3. If all subroutines have been relocated, go to 4; if not, go to 5. 

4. Stop. (We can then start on the main program!) 

5. Isolate the constant M, and construct (6), a variable instruction 
to bring in the fictitious instruction and (8), a variable instruction to 
put out the adjusted instructions. Also, make appropriate prepara- 
tions in (7) and (9). 

6. [Bring in the fictitious instruction. ] 

7. Process the fictitious instruction. 

8. [Replace the adjusted instruction. ] 
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9. If all instructions in the 7th subroutine have been processed, go to 
12; if not, go to 10. 

10. Advance addresses in 6 and 8. 

11. Go to 6. 

12. Advance address in 2. 

13. Go to 2. 

The construction of a program of this character should be compulsory. 
For from this have developed massive programs called assemblers or 
compilers. For example, such programs can be organized to search a 
library tape for a specified subroutine, find a space in the main program 
for it, copy it there, and then relocate it in the sense just described. 
There is hardly any limit to the complexity of compilers, but there is 
certainly a point of diminishing returns from investment in their con- 
struction. 

We now take up the question of the use of subroutines, now assumed 
in storage, properly addressed. Ifa subroutine is to be used in only one 
part of the main program, though repeatedly, then there is little trouble. 
The structure of the square-root subroutine is as follows: if the control is 
sent to the entry, with the argument xin a standard position (the accu- 
mulator, for instance), then the operation is carried out, finishing with \ x 
in the accumulator and the control at the exit. In the present circum- 
stances we can direct the entry and set up the exit, once for all, in the 
main program. 

However, if we want to use the same subroutine several times, in 
different places on the main program, more elaborate arrangements are 
necessary. These depend greatly on the facilities of the machine. We 
discuss an unsophisticated case first. For clarity, consider a square-root 
subroutine which produces vx in B. with the control in m, if the control 
is sent to m, with xin B,. Suppose that we want to use this subroutine in 
two places in the program, after instructions c, and cz. We may then 
use the scheme 


¢, +1 Put x in f,. 
¢, +2 Put “Go to c, + 4” in the accumulator. 
¢, +3 Go to m,. 


6 +1 Put x in fp). 
6 +2 Put “Go to c, + 4” in the accumulator. 
Cc + 3 Go to my. 


provided the subroutine begins by setting up its exit in the following way: 
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Entrance m, Put the instruction in the accumulator in m,. 


Exit Ms Go toc; + 4—set by m,. 


It is clear that we need to store an instruction of the form “Go toc, + 4” 
for each 1, for use inc; + 2. 

A more compact method of handling this problem—at the expense of 
equipment, of course—is the use of a “record” instruction. Such an in- 
struction in cell ¢ has the form 


C Record “Go to ¢ + 2” in mg. 
If this is followed by 
c+ 1 Go to m,. 


we have accomplished our task. 
To show the compactness of this, we indicate the programming for 
x'*, with x being given: 


Put x in f,. 

Record “Go to 4” in my. 

Go to m,. 

Put x’ (now in f,) in A). 
Record “Go to 7” in my. 

Go to m,. 

Put x% (now in £,) in fA). 

Record “‘Go to 10” in my. 
Go to m,. 

10. Exit; x is in By. 


Ye a 


We conclude this section by returning to the consideration of the re- 
location or incorporation of subroutines. We have indicated how this 
can be done by a special program. In some computers equipment 1s 
available to do this. The subroutine always stays in the storage with 
fictitious addresses, but those instructions which require a constant to be 
added to the address are distinguished in some innocuous way—for ex- 
ample, by the use of a negative sign, where regular instructions have a 
positive sign. Then the equipment is so arranged that the assigned con- 
stant is added to the address in the control unit during the decoding of 
the instruction and before its execution. 

This feature can often be combined with the one discussed earlier, 
whereby a change in the sign of the B register triggers a change of path 
in the program, to give an elegant program. We illustrate this by 


Google 


192 SURVEY OF NUMERICAL ANALYSIS 


returning to the memory summation. We use * to indicate B modifi- 
cation. 


QO. Clear register for partial sums. 

l. Set B = Bp — x. 

2.*Clear and add (z). 

3. Add partial sum. 

4. Store partial sum. 

5. Decrease B. If B goes negative, go to 6; otherwise to 2. 
6. Stop. 


4.22 Ordinary Differential Equation 


We now take up problem 1 of the preceding section. We consider 
using the Heun method for y’ = f(x,y). This consists in determining 
yo he) X= 0h, for n 051, 2.6%. DY 

Intr = 72m + In*)s 

where y= = 7, + Af(x,,¥,), 00" =7_, + Af(Xpay0a)- Assuming the valid- 
ity of Taylor expansions, it is easy to see that the local error is O(43) and 
that the total error (assuming no magnification) is O(A?), since the num- 
ber of steps is O(h~!). In the case under consideration, an expansion in 
power series at the origin 1s not valid, and the errors will be larger. An interval 
h = .001 seems appropriate if we are to have results correct to about 6D. 
Rough calculations indicate that _» will exceed unity somewhat, and in 
order to avoid overflows, in a fixed-point machine, using numbers in the 
range (~1,1), a scaling by a factor of % of each variable seems con- 
venient. We therefore consider the solution of 


y =Yve+hvy, 9(0) =0 
for x = 0(A).25 (see Richter [10]). It is instructive to keep A variable 
and to arrange to print out, not after every step, but only at an interval 
of .0125in x. The details of the program are omitted. 
The results of the calculations on a UNIVAC for & = .0125, .0025, 
.00125, and .00025 are shown in the table on the facing page. 


4.23 Rectangular-Polar Conversion Table 


This is to be regarded as an exercise in using the output equipment. 
The task is to produce a table which, initially, is a list of the form 


00002 00001 0022360680 4636476090 
00003 00001 0031622777 3217505544 
00003 00002 0036055513 5880026035 
00004 00001 0041231056 2449786631 


00100 00099 1407160261 7803730801 
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x hk = .00025 h = .00125 hk = .0025 hk = .0125 
0 00000000000 00000000000 00000000000 00000000000 
-0125 00054793700 00054084766 00052730731 00034938562 
-0250 00159912174 00159108463 00157549032 00135795112 
-0375 00299888889 00299020417 00297323965 00273099005 
-0500 00469064935 00468 145066 00466341119 00440237794 
-0625 00664121481 00663 158054 0066 1263636 00633609674 
-0750 00882794453 0088 1792703 00879819092 00850825638 
-0875 01123406696 01122370425 01120325726 01090141626 
- 1000 01384648727 01383580841 01381471217 01350207833 
-1125 01665459906 01664362708 01662193028 01629937288 
-1250 01964957501 01963832876 01961607102 01928429044 
1375 02282391333 02281240853 02278962286 02244919744 
1500 =. 02617113183 02615938188 02613609625 02578751306 
1625 02968555308 02967356957 02964980810 02929348394 
1750 03336214844 03334994148 03332572516 03296202153 
1875 03719642145 03718399993 03715934729 03678858123 
2000 04118431874 04117169062 04114661815 04076907044 
2125 045322 16083 04530933327 04528385574 04489977709 
2250 04960658706 04959356649 04956769724 04917731319 
-2375 05403451139 05402 130364 05399505479 0535985695 1 
2500 05860308623 05858969672 05856307932 05816067873 
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Then this should be edited by the suppression of initial zeros, insertion or 
suppression of decimal points, labeling of pages, separation of entries into 
groups of five, spacing between the groups of five, and double-spacing 
after completing a value of m. The final arrangement should be com- 
parable with the tables of Neville [31] or Todd [30]. The effort required 
to produce it will depend on the equipment available. 


4.24 Sievert’s Integral 
Consider the evaluation of 


S(x,y) = [exp (—y sec ¢) dt 


for a single argument pair. We choose x = .5, y = .5 since in this case 
no scaling difficulties are encountered. A rough calculation shows that 
using Simpson’s rule with 4 = .01 will give an accuracy of about 10D. 
We suggest three methods. 

(a) If we assume that we have available subroutines for the evaluation 
of exp x, sec x and for the application of Simpson’s rule, the problem is 
an easy exercise in the combination ofsubroutines. [The Simpson’s rule 
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subroutine would have the following structure: 2, a, 8 would be param- 
eters which required setting, and there would be an exit to the f/x; sub- 
routine from which the control would return with f(x) in an assigned 
position. | 

(b) We observe that it will probably be quicker to generate the secanis 
from the cosines using the addition formula 


cos ¢,,, = cost,cosk — sin¢, sink 


| 


sinf,., = sin¢t,cosh + cos¢t,sin A 
than by an independent evaluation of the cosine from the power series 
for each ¢,. 

(c) Various orthogonal quadratures can be used (see Chaps. 2 and 3’. 
The value obtained by the second method was 


$(.5,.9) = .29665 75005; 
the value obtained by interpolation in a standard table [9] is 


S(.5,.5) = .29665 7503. 


4.25 Matrix and Vector Operations 


The construction of codes for the basic matrix problems of inversion 
and decomposition is considered beyond the scope of an elementary 
course. We enumerate here a selection of auxiliary codes which are use- 
ful and the construction of which is instructive and not too involved. 


(a) The Component of Afaximum Modulus of a Vector 


We note that in some machines the determination of the component 
of maximum modulus of a vector has been incorporated as a basic in- 
struction. 


(b) Normalization of a Vector 


The normalization of a vector involves the replacement of 
V = (Uy,%q,.-.52,) Dy U = (uy,ua,...,u,) where U =kV, k being a 
constant such that ~ u,? = |U'|? has an assigned value. 

Let us discuss scaling in this problem. Suppose that we assume that 
alwaysn <<. 99. Then, if we assume |z,{ <.1,wehaveXv,2< 1. IPSfwe 


take 

i, = 10) >. 02), 
we observe that |U’]| = .1 and each |u,| <.1. No overflows will take 
place. 
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(c) Scalar Product of Two Vectors 


It is often desirable to have the subroutine for the scalar product of two 
vectors written so that the partial sums are kept in double precision and 
the final result is presented as a single precision number. 


(4) Various Norms of Matrices 
Consider, in particular, the computation of 
M, = max Ia,;l, M, =n? > la; ;\, M, = n'(2 aie)* 
i,j i,j : ‘ 
(e) Generation of Special Matrices 
For instance, consider the storage by rows ofthe x nmatrix A = (a,,), 
where a,; = 4;; =i/jift <j,instorage beginning atm. ‘That is, we want 
to put 
Qy1p Qyan+ +25 Qn IN mm+1,...,m+n—]1 
Bory Gag) + + +» Ay in m+nym+n+]1,...,m+2n—] 


Other examples of special matrices are given by Newman and Todd [2] 
(see also Chap. 6 and Marcus [+42]). 


(f) Matrix Multiplication 


If we disregard all questions of overflow, matrix multiplication involves 
a straightforward triple induction. We give on p.196 a flow diagram 
for one arrangement in which the product C = AB 1s computed by rows 
and the rows are printed out oneatatime. Weassume that each matrix 


isann < none. 


(g) Evaluation of Efficiency of Inversion Programs 

We assume given a program for the inversion of a matrix, and we want 
to apply it to special matrices and to observe norms of inversion errors. 
This involves the use of (e) above in generating the matrix, then the 
inversion program under examination, and then (/) to compute error 
matrices 

XA — I, AX — I, xX — A, 
where X is the reputed inverse of A and X that of X. If 4~' is known 
explicitly, we can also compute 
X — AW, 


Finally, we have to apply (d) to the error matrices. Typical results are 
given by Newman and Todd [2] (see also Chap. 6). 
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aj; b,=p 


Ad 
Modify 


vance k 
orders, e.g. * 


Advance j 
Modity orders, e.g. ** 


Fic 4.3. Matrix multiplication program. 


(h) Determinations of Dominant Eigenvalue and E:genvector 
by the Power Method 


We outline a simple method which is often quite efficient. Suppose 
Aisann xX nmatrix with eigenvalues «,, a,...,%,, where |a,| > |a,| > 
la,] >-°+ > |a,]. For simplicity, suppose that the eigenvectors ¢,, 
Co, . ++, ¢, Span the whole space. Take a vector v and represent it in 


terms of the c¢;: 
i 


Since Ac, = «,¢; for: = 1, 2,...,m, we have for any integer r, and for 
eal Oe eer (8 
r ce ? 
A’e, = a/c: 


Hence, provided a, 4 0, we have 


yl) — Ary) — YP aja'¢, = aya,", 
i 
provided r is large enough, since our assumption that a, is the dominant 
eigenvalue means that the succeeding terms become negligible. Thus 
we have shown that v'”) is approximately a multiple of c, and that the 
ratio of the components in v'"+)) to those of uv” is approximately «,. 
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This process is applied as follows. Anarbitrary normalized vector v 

ischosen. Then normalized vectors v") and constants pu") are chosen so 
that Av) = plitdyGtn, 
Then pz" will tend to a, and uv‘) to the normalized c,.. The normaliza- 
tion may be chosen in the sense of (4) above, or a fixed component of the 
ov) may be forced to assume a fixed value. The condition a, 4 0, that 
is, that the initial vector should not be orthogonal to the dominant eigen- 
vector ¢,, 1S not a critical one; for the orthogonality will probably be 
destroyed by round-off, and the dominant vector will be obtained after 
some delay. 

The following example, which presents no difficulties, is discussed by 
Taussky and Todd [4]: 


s2 9 1.32 
A = { —11.2 22.28 —10.72 ], 


—35.8 9.45 —1.94 


with v!®) = (1,0,0). The following example has been discussed by Bode- 
wig [7] and Wilkinson [5] and is troublesome: 


2 l 3 4 


1 —3 ) 
Ae 

3 I 6 —2 

4 5 -—2 —!]1 


Various refinements of this method have been discussed—in particular, 
the use of acceleration schemes such as the Aitken 6? method (see, e.g., 
Wilkinson [6] and Osborne [24]). 


4.26 Floating, Double-precision, and Complex Arithmetic 


In many problems it is found that a straightforward use of the machine 
arithmetic is insufficient. The quantities involved may have a large 
range, be required to higher precision, or be complex. In order to cope 
with such problems it is essential to have subroutines to carry out the 
arithmetic operations in these cases. Although these are, in principle, 
relatively easy to construct, they are slow in operation, and pressure has 
been brought on the machine designers to provide, in particular, ma- 
chines with floating-point arithmetic. The convenience of such equip- 
ment is notable, but it is dangerous in uncritical hands. Perhaps the 
proper use of the added facilities is in exploratory calculations which will 
indicate when and where scaling will be needed in the main, faster fixed- 
point program. 
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(a) Floating-point Arithmetic 


The usual arrangement on decimal machines with, say, a sign and 10 
digits is the following. We consider numbers A ¥ 0 which can be rep- 
resented in the form +10%:a, where —50 < B < 49 and where .1 < 
a < 1 and where ais represented by an 8-decimal-digit number. ‘Then 
a = B + 50satisfiesO < B < 99, and Ais represented by its sign, by the 
2 digits of «, and by the 8 digits of a. Certain conventions about the 
handling of zero must be introduced. 

We have to develop an arithmetic for the ordered pairs («,a)—the un- 
packing of these is accomplished by extraction or shift operations. Let 
us discuss the addition of A, = («,,4,) and A, = (a,,4,).. Suppose that 
the numbers are labeled so that 4, — «, = 6 > 0. 

1. If 6 > 8 (in our case), then A, 1s negligible. 

2. If 6 = 0, we canadda, + a, = a; andif there is no overflow, then 
A, + A, = (a,,a). If there is overflow, we have to round the last digit 
but one, shift right one place, and insert a unit in the first place to get a. 
(Observe that, since the sign of a, + a, is not known, the rounding and 
the adjustment of the first digit must be done withcare.) Then we have 
to advance the exponent and combine the «, ain one word. There 1s 
one further point to be considered: it is possible that a, + a, may not be 
in normal form. (An extreme case would be adding A and —A.) We 
have therefore to determine whether left shifts and adjustments of the 
exponent are necessary and, if so, to make them. (We note that the 
terminal zeros introduced here are not significant.) 

3. If 1 < 6 < 8, we have to shift a, to the right 6 places before carry- 
ing out the addition, rounding off appropriately. We are then essen- 
tially in the case already discussed. 

The schemes for multiplication and division are somewhat less comph- 
cated. 


(b) Double-precision Arithmetic 


Addition in this mode is comparatively simple, but multiplication and 
division are more troublesome. It is recommended that some double- 
length arithmetic on desk machines be carried out before undertaking the 
programming. 

The case of the product X,X, is indicated in the following diagram, 
where X, = (x,,),): 


es a ee 


x1) I 
XyIy is eee 
Ji J2 ae ree 
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The rounding in the less significant part of 7, y. will usually be accom- 
plished by a special rounding instruction or a special multiplication in- 
struction. Wehave now to add the less significant parts of x, 79, 21, to 
the (now rounded) more significant part of_y,7,. We observe the over- 
flow, if any, and then observe whether the sum satisfies |s| > 14, in which 
case, on rounding, it will contribute to the last digit in the less significant 
part of the product x,x,; the total contribution of this sum there will be 
0,+1, +2, +3. 

The next step in obtaining the less significant part of XX, is the addi- 
tion of the more significant parts of x, 7, and x,7,, and the less significant 
part of x,x, to the contribution just obtained. Any overflow in this 
addition is carried to the more significant part of x,x, to give the more 
significant part of X,Xo. 


(c) Complex Arithmetic 


The handling of complex numbers in the form x + iy or re‘°—where 
the pair of real numbers (x, y) or (7,9) is stored in two cells, or in one—in- 
troduces no new complications. 


‘d) Subroutine Structure 


The following remarks are relevant when the machine under con- 
sideration has no built-in operations of the kind contemplated. 

In general, all the arithmetic operations will be needed—addition, 
subtraction, multiplication, and division—and it is convenient and eco- 
nomical to have a single subroutine to handle them, if entry to it is made 
atanappropriate point. (It may also be convenient to have appropriate 
conversion routines included—e.g., to “‘float”’ or to “‘unfloat”’ a number, 
to convert from cartesian to polar form and inversely.) It may also be 
convenient to place the whole subroutine in a fixed position in the mem- 
ory, so that the addresses for the arguments will be fixed. 


4.27 The Heat Equation u,, = u, 


Some important considerations arise in the study of the numerical 
solutions of the partial differential equation u,, =u, We suppose that 
we are asked to solve this for ¢ > 0,0 < x < 1, subject to the boundary 
conditions 


u(x,0) = f(x), f(x) given for 0 <x <1; 
u(O,t) =u(l,t) =0,, allt >0. 


The exact solution can be written down if we can expand f(x) as an ab- 
solutely convergent Fourier series 


1 
f(x) = da, sin inx, a= 2 u(_y,0) sin iy dy. 
0 
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In this case, 
u(x,t) = > a, sin tax exp (—2?27*t). 
A simple discretization of this problem, in which we replace the space 
derivative by h-?[u(x + h, y) — 2u(x, y) + u(x — h,_y)] and the time deriv- 
ative by k-1[u(x, y + k) — u(x, y)], where kh-* = 1, is 


U(m,n +1) =rU(m — 1,n) + (1 — 2r)U(m,n) + 1U(m + 1,0), 
U(m,0) = f(mh), m=0,1,...,M+1, 
U(0,nk) = U(M + I1,n) = 0, tae a eee 


Here U(m,n) is to be thought of as an approximation to u(mA,nk). 

This partial difference equation can be solved explicitly, and it can be 
shown that U -+ u as h — 0, provided the ratio kh-? = rsatisfies r < 12. 
This question is fully discussed in Chap. 11 (see also Richtmyer [29]). 

Let us discuss the case in which f(x)=2x,0 <x < 2, f(x) =2(1—2), 
lg <x <1. Consider the evaluation of u(}2,34) and various approxi- 
mations U, ,(12,3¢4). This example was devised for a binary com- 
puter, but alittle thought will show that it is equally suitable for a decimal 


one. 
We find 


u(x,t) = 87~2[sin mx exp (—7*t) — 4 sin 3x exp (—97°t) 
+ 145 sin 52x exp (—257?) — +s], 
which gives 


u(4,364) = .51175 20442. 


We can obtain U(m,n) explicitly as 
M 
U(m,n) = > ¢, sin 2mi6(1 — 4r sin? 16)", 
i=1 


l 
where 6 = SMa and Ci 7 SU U(sh,0) sin 2759. 
In the present case we find 


c, = 0, t even, ¢; = 2(—1)*"-)(M + 1)-*cosec? 76, todd. 


We now fix r = % and obtain, where the summation is over odd 1, 
l<i: <M, 


U(myn) = 2(M + 1)-2 ¥ (—1)4°- cosec? 16 cos?" 16 sin 2mid. 


If we now take M odd and put m = 42(M + 1), so as to observe only 
the central point, we find 


U(l4(M 4+ 1),n) =2(M + 1)-* ¥ cosec? 76 cos?" 76. (4.2) 
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It is trivial to find directly for M = 3, n = 3 that 
U, (32,364) = 1%2 = .53125. 
An exact calculation by hand, or by machine, for M = 7, n = 12 gives 
U,(¥2,3¢4) = .51661 72981. 
A hand calculation, using (4.2), gives for M = 15, n = 48, 
U,,(14,%4) = .51296 84502 
whereas for M = 31, 2 = 192, we find 
U;,(32,%4) = .51205 61854. 
The last 3 digits in U,, and U,, are doubtful. The convergence to 
u(14,%%a) = U.,(%,%64) = .51175 20442 


isevident. Ifweround these values to five decimals and take the differ- 
ences Uy, — u, we find 


1950, 487, 122, 31, 


indicating that the error 1s O(h?). 

The programming of the solution of the partial difference equation is 
trivial. The results given above should be reproduced if we choose any 
r<'. However, if r > %, say r = 34, the phenomenon of numer- 
ical instability appears: if some rounding error is introduced, it gets am- 
plified exponentially, and no sensible results are obtained. To make 
sure that a disturbance is introduced, it may be desirable to use, for ex- 
ample, (7/64) f(x) in place of. f(x) : the factor 7 ensures that rounding will 
take place, and the factor %4 gives room to show the large oscillations 
which develop (see, e.g., Richtmyer [29], pp. 6—9). 


4.28 Number Theoretical Problems 


In number theoretical problems we have to imagine the decimal (or 
preferably, binary) point placed at the extreme right of the word, rather 
than at the extreme left. 

(a) A basic subroutine is one for the factorization of a number N. 
This can be constructed by trying all possible divisors d < N“*. Asim- 
pler version would be to try 2 and then all odd numbers. A more 
sophisticated one would be to use a sieve technique—for example, trying 
the numbers congruent to 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31 (mod 30). 

It is, of course, only necessary to try prime numbers as divisors, but the 
storage of a table of primesin the memory might not beconvenient. We 
mention here an elegant device used by E. Lehmer [20] for a compact 


Google 


202 SURVEY OF NUMERICAL ANALYSIS 


storage of primes on a binary machine: a zero or one is stored in con- 
secutive positions of a word, according as the corresponding odd number 
is prime or composite. Thus 


0001 Q010 O101 1001 1010 O101 41011 OO11 O 


corresponds to the odd numbers 3, 5, 7, 9,.... 

(6) Another fundamental subroutine is one for the euclidean algo- 
rithm, that is, finding the highest common factor of two integers or solving 
linear diophantine equations of the form ax + by = d. 

Various applications of this are possible. It is convenient to have the 
early terms of various power series related to a given one—its various 
powers, its reciprocal, the inverse function, the logarithm, etc. It is 
desirable to have the new coefficients exactly (as rational numbers), and 
since complicated manipulations with power series are fraught with error, 
it is natural to think of doing these manipulations mechanically. Pro- 
grams for this have been constructed by Henrici [25,27], and in them 
the euclidean algorithm in some form is essential, so that rational 
coefficients can be reduced to their lowest terms. 

A euclidean algorithm can be used to exhibit any unimodular 2 x 2 


. a 
matrix 

C 

two generators 


| with integral elements, ad — bc = 1, as a product of the 
1 | ( —1 
= Ls 
f ‘ ] | 


1 0 
( = STS?*7S. 
2 1 


For instance, 


If the algorithm subroutine is written for complex integers, then it is pos- 
a b 

sible to carry out the corresponding decomposition for matrices ( 7 : 
c 


where a, b, c, dare complex integers with ad — be = 1. Anysuch matrix 
can be expressed as a product of powers of 


oa (a dl Go a) 


(c) The theory of quadratic forms is a fertile source of problems. 

(d) There are many other problems in this area, some of which place 
great demands on the computing equipment (see, e.g., Chaps. 15, 16; 
E. Lehmer [20]; Taussky [22]; Tompkins [23]; D. H. Lehmer [41]; 
Taussky and Todd [48]). 
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4.29 Game Theory and Linear Programming 


The basic idea of a two-person zero-sum game and the Brown- 
Robinson method for its solution have been described in Sec. 1.7. 
This is easily programmed. 

The most satisfactory solution of linear-programming problems is often 
by the simplex method. The programming of this method, which is 
essentially equivalent to the Gaussian elimination method for the solution 
of a system of linear equations, or the inversion of a matrix, is not a task 
fora novice. Nevertheless, the use of any simplex-method programs on 
special cases is a valuable exercise. Some results have been given by 
Hoffman, Mannos, Sokolowsky, and Wiegmann [28]. 


4.30 Service, Checking, and Engineering Subroutines 


(a) Anyone who does a lot of programming, and any efficient organi- 
zation, will see the desirability ofa battery ofservice routines to do certain 
recurring, more or less clerical jobs. For instance, it is convenient to 
havea subroutine which examines two tapes 7,, 7, containing supposedly 
identical programs word by word and prepares a third tape 7, identical 
with 7, and 7, should these coincide. If at any stage the words in 7, 
and 7, differ, then both are printed out, and the machine does not pro- 
ceed until the operator has decided which, if either, 1s correct. Aneven 
simpler subroutine is one which copies an assigned part of one tape onto 
another or which both copies and inserts specified corrections. The de- 
tails of these service subroutines depend greatly on the equipment, and 
we shall not discuss them further. 

(b) The process of checking a program is one which deserves a 
thorough study. It is clear that many isolated parts of the program— 
essentially “‘subroutines’”—should be checked by themselves. ‘Then the 
checking of the major program is reduced essentially to that of logic, the 
Interconnections of the subprograms. 

Many schemes have been devised to use the computer itself to help in 
this process. If, for example, the machine has special actions on over- 
flow, it may be desirable to list all instructions in the program which can 
cause overflow and then to check visually that the appropriate actions 
are arranged. Such a listing is easy to obtain mechanically. Note, 
however, that this is not easily made foolproof. For we would probably 
only examine the static program, and during the course of running it, 
Instructions might be built up which could cause overflow. 

Again, a very naive check (and, one would hope, a last resort) would 
be to work through the program, instruction by instruction, examining 
the contents of critical registers. Such information can be obtained and 
printed by the machine for contemplation by the programmer. Some 
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machines have equipment which does this monitoring by use of a simple 
switch. Somewhat more sophisticated would be the monitoring, not of 
all instructions, but only of specified ones—often those causing transfers 
of control—which could be indicated by break points. 

The path of a computation can be followed by printing, for instance, 
symbols indicating the instructions used, with a new line at each transfer 
ofcontrol. A useful exercise would be to carry this out for a simple sub- 
routine, like that for the square root. 

It is also possible to design a subroutine which compares the program 
as it was when inserted with what it was when trouble developed and 
lists any changes. 

(c) Various phases of the work of computer engineers can be facilitated 
by the use ofacomputer. Forinstance, a series of diagnostic subroutines 
can be used to pinpoint trouble when it develops and also to aid in 
preventive maintenance. The memory-summation process discussed 
earlier is a simple example. 

It is possible to use the machine to find out where time is spent in run- 
ning problems. The number of times each type of instruction is carried 
out during the running of a problem can be counted. This information 
can indicate areas where effort could be directed to improve the design 
of the equipment (see Herbst, Metropolis, and Wells [36]). 

Finally, machines can be used to prepare the wiring diagrams for their 
successors. 


4.31 Current Developments 


At this stage the reader should have become aware of the extraordinary 
obedience, reliability, and speed of current computers. He will prob- 
ably have experienced, in this context, some human weaknesses. He 
will have rightly concluded that efficiency will be increased by the use of 
the computer to assist in programming. 

The cost of the selection of an algorithm for the solution of a general 
problem (for instance, the determination of all the eigenvalues of a sym- 
metric matrix), the analysis ofits realization on an actual computer, and 
the programming and the comparison of the theoretical error estimates 
with the observed ones in a series of representative cases is enormous. 
When such a program is needed on a different computer, it may be worth- 
while to simulate the first one on the second, rather than to recode the 
algorithm. This sort of operation is often used by organizations when 
they get new machines and are not able to decide whether recoding, 
taking advantage of the facilities of the new equipment and the accumu- 
lated experience on the older, 1s desirable. 

Considerations of this kind lead to the concept of a universal language 
for the algorithms; each computer would have a translator program 
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which produces from a program written in the algorithmic language the 
program appropriate for its own use. One of the more elaborate inter- 
national experiments in this direction is the use of the Algol language 
(see, e.g., Backus et al. [46]). 

It is not appropriate for us to elaborate on these ideas here. For an 
account of the ideas of the Russian schools, we refer to Liapunov [44]; 
the British point of view is put forward by Gill [45] and an American 
by Carr [1]. 

Our present position on the use of advanced programming techniques 
is that, although they are essential for the expert, the use of ‘‘simplified”’ 
pseudo codes by the novice is dangerous, for he will produce too many 
solutions, by improper methods, to incorrect problems. The additional 
time spent by the novice in handling his problems by traditional methods 
will force him to think through his problems again and improve their 
formulation and analysis. 
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5.1 Introduction 


The numerical analyst can rightfully claim credit for motivating the 
development of the high-speed computer. With the development of 
the computer, however, a very conscious development of design logic 
occurred which was largely irrelevant to the analyst’s interest. Then 
another type of logic, computability logic (which had been developed 
long before the computer), suddenly found relevance “in principle,” 
without helping the analyst get specific answers! 

Perhaps computer logic will not long remain remote from practical 
numerical analysis. Right now the newer machines seem to be beyond 
the speed and storage requirements of the problems that were their moti- 
vation, and the newer machines still can be made to seem inadequate 
only when confronted with problems of a new. type (in programming, 
pattern recognition, etc.). 

The purpose of this chapter is not futuristic. It is to describe the logic 
of computer usage briefly and informally, requiring some basic familiar- 
ity with programming but no familiarity of any sort with formal mathe- 
matical logic. The state of the literature [1] is too dynamic to make a 
comprehensive bibliography feasible, and the references cited are mostly 
cursory in nature. 


5.2 Computer Requirements 


The electronic computer was designed to imitate both the straight 
numerical work and the routinized thinking of the numerical analyst. 

The straight numerical work consists of an ordered list of instructions 
defining new quantities on the basis of calculations of the type 


AGB=C. (5.1) 
208 
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Here a variety of arithmetic calculations are called forth, such as addition, 
subtraction, multiplication, and division, as well as extended operations 
such as A — |B| = Cand —AB =C. In fact, operations with only one 
variable, such as —|A| = C, are still written in the form (5.1) (i.e., B is 
ignored). The letters, of course, stand for numerical values (and not 
formal polynomials, forinstance). Still, calculations of type (5.1) do not 
make manifest certain widely accepted techniques, such as comparisons, 
data layout, and the merging of different subroutines. 

We might start by considering comparisons, as required by that well- 
known flow chart for solving A = f(A) to accuracy ein Fig. 5.1. Cer- 
tainly we do not know beforehand how many iterations are needed, nor 


Compare 
Compute 


Instruction J: 
|D|-« —>@Q 


f(A)-~B 

ios is Q<0? 

Instruction J: Instruction K: 
B-—>A etc. 


Fic. 5.1 Flow chart for iterated solution to A = f(A). 


is it practical or even correct (considering round-off) always to use the 
same number of iterations. Hence the loop in the flow chart spares us 
the writing of instructions in the running program. It also spares us the 
writing of data-iteration symbols of the type C = f(B), D = f(C), etc. 
{ Note, even here, the parallelism involved in saving writing of data and 
instructions.) For that reason we generalize the “=” to ““—,”’ meaning 
that, ifC has already been computed, the new value replacestheold. In 
effect then, A, B, D, Q,... become not only names but locations into which 
the variables are to be written. In corresponding fashion, the instruc- 
tions are denoted by symbols which refer to the location where the in- 
structions are written out. 

Our general instruction now is described by saying that the value of 
the instruction at location J is given by 


J follows if C > 0 
K follows if C < 0, 


(9.2a) 


(5.2) 
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There are five addresses (locations A, B, C, J, K) and one operation %. 
Certainly many of the addresses can be coincident or superfluous. 

The idea of data layout is probably best exemplified by matrix methods. 
The spacesaving feature does not apply to data but to instructions, since 
only one instruction need be given to apply to a whole column (e.g., 
“Subtract 5 times column one from column two’’).. The column opera- 
tion can be performed by using the same instruction over again by chang- 
ing the data locations in adding aconstant. Hence, defining A(J), B(J), 
C(I) to be the data /ocations used within instruction / (not the data values) 
and letting A’, B’, C’ be locations at which new locations (not data) are 
to be found, we must be able to extract into the instructions at / the value 
of the location found at location A’, by means of the statement 


E,= la +A(I) and J follows, (5.3) 


and likewise for E,,,£.. Parenthetically, such an instruction has mean- 
ing only if the appropriate arithmetic can be performed on the data Joca- 
tions A, B, Cas well as on the data values. 

One would expect an instruction-location analogue to the data- 
location operations (5.3). One would expect that, defining J(/), K(J) 
as the Jand X locations in instruction (5.2), one would have the extrac- 
tion instruction for inserting the address at J’ or K’ into an arbitrary 
instruction at J, namely, 


F,=|J'>J(l) and _—_iL follows, (5.4) 


and likewise for F,.. This type of instruction actually comes about when 
we are digressing from a main calculation to an auxiliary calculation or 
subroutine. For instance, we may wish to calculate C and then proceed 
to the instruction at J’ifC >Oor K’ifC <0. IfJin (5.1) is the dast (or 
“‘exit’’) calculation, yielding C, we perform extractions of the type F,,, F,,- 
[see (5.4) ] into instruction J before entering the subroutine producing C. 

Thus we see that operations (5.2) to (5.4), made inevitable by the needs 
of numerical analysis, are a kind of irreducible minimum [2] for a numer- 
ical analyst. We might well ask whether the minimum can become 
greater with the advent of more skillful numerical analysts, Although 
the answer scems to be negative, such an eventuality can be considered 
only by reducing the machine to several equivalent forms. 


5.3 The Stored-program Computer 


A computer can meet the requirements of the previous section simply 
by treating instructions in the same fashion as data [3]. Then the arith- 
metic operations, augmented by extractions, are applied to data and to 
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instructions. Now instructions and data are composed in digital fashion 
(with the A, B, C, J, K locations expressed and distinguished digitally, 
together witha digital representation of the operation ©). What charac- 
terizes an instruction Is the fact that it becomes executed. The instruc- 
tions that become executed, however, need not all appear in the program; 
some instructions are composed when needed and then destroyed. 

In effect, the five-address system (5.2) is not necessary. Many high- 
speed machines are one-address; that is, only one address appears in the 
instructions. ‘This system is set up by first sequencing a natural order of 
instructions so that, say, J + 1 is the next location after / and then by 
singling out a special storage location & to serve as a so-called “‘accumu- 
lator.”” Then the instruction /reduces to the following set of instructions 
in natural sequence: 


I (load) A+R. 

I+ 1 (calculate) R@®B-R. 

[+2 (store) R—-C. | (5.5) 
{+3 (branch) Follow by K if R < 0. 

[+4 (transfer) Follow by J unconditionally. 


Note that in each case there is only one address in reference: A, B,C, K, 
or J. 

Obviously a great variety of machines are feasible. Only one other 
point is worth mentioning here. An instruction is usually fetched to a 
separate “instruction register” for execution (and, as it is performed, the 
values of the data are loaded and stored; then the value of the succeeding 
instruction, at J or K, replaces / in the instruction register). 

This setup has the advantage that the particular extraction operations 
(5.3) and (5.4) can be separated from the rest of the arithmetic (5.2). 
For instance, the machine can have three “indexing registers’’ at fixed 
locations A,, By, Cy such that in the executed form (rather than in the 
stored form) of / the effective A address is incremented by the contents of 
location Ag, etc., thus taking care of (5.3), whereas the machine can have 
in its so-called “instruction counter’’ /, the successor of the current in- 
struction location J. Hence, in using a subroutine, the machine need 
only file the instruction counter in the J or K portion of the exit instruc- 
tion J. Fortunately, science is so much the richer for the fact that in- 
discriminate mixing of instructions and data occurred before machines 
were developed for avoiding the mixing. 

We have made no mention of how the arithmetic is to be performed, 
or essentially how many © operations are to be permitted, and how much 
of the extraction, indexing, and filing facility is to be permitted. Clearly 
these operations are highly interdependent in the sense of Boolean alge- 
bra, which we give a wide berth in this discussion. In fact, without 


Google 


212 SURVEY OF NUMERICAL ANALYSIS 


mentioning it, we showed that (5.2) to (5.4) constitute a machine equivalent 
in some way to (5.5) (provided the proper digital structure and extrac- 
tions are assumed). Certainly a standard machine would be helpful, as 
well as an equivalence theory. 


5.4 Turing Machines 


Turing, in 1937 (before the age of advanced electronics), conceptu- 
alized a machine [4] to describe the human-decision problem. ‘The 
outcome of his work was that his machine could not make certain meth- 
odological decisions which human beings generally regard themselves 
as being capable of making. We come to this phase of machine theorv 
later, but for the present we consider just Turing’s machine. 

The Turing machine has only a finite number of states (or instructions: 
I,,...,/y, but the machine has a semi-infinite tape. The tape is divided 
into squares which are either marked (1) or blank (0). The machine 
has a scanning head which can read and write only one square at a time. 
The basic operation is as follows. ‘The machine starts in state J. Then 
the scanning head reads a square. Next the machine has two alterna- 
tives, depending on whether the head hasreada markorablank. Once 
the alternative is decided, the scanning head follows a predetermined 
course of action, marking or erasing the scanned square. The machine 
finally goes into another predetermined state, while the scanning head 
makes a predetermined motion on the tape, right or left, by one square 
(or with no motion at all). The finiteness of the human physical struc- 
ture is presumably reflected in the finiteness of the number of states, and 
the infinite variety of physical data is represented by the semi-infinite 
tape. In practical terms, Turing considers the initial and final data to 
be given on, say, odd-numbered squares, and the calculation is known 
to be at an end when the reading head reaches the finite end of the tape. 

There are unfortunately very few problems which are conventent/y pro- 
grammed on the Turing machine just described. We mention one 
simply to show how the Turing machine, like the stored-program com- 
puter, uses accessories as flow charts [5]. 

Consider the following problem concerning a tape marked as shown: 


(End) QO100101000110010.... 


We wish to have a general program in which the reading head, starting 
at the leftmost symbol (presumed to be zero), will progress and erase all 
isolated Is, up to the first nonisolated Is and then return. Thus the final 
tape will be, for example, 


(End) 00000000000110010... 


and the reading head will return to the original zero. 
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The program is shown in Fig.5.2. Actually, programs (or flow charts) 
are called “special-purpose” machines, whereas the Turing machine 
(which simulates any flow chart step by step) is called a “general- 


purpose” or “universal” machine. 


Fic. 5.2 


machines implies the ability of each machine to simulate the other. 


Read 
1 impossible 


Explanation 


Mark the first square for 
later return. 


Hunting loop: Find the next 1 
and erase it without asking if 
erasure was justified. 


If erasure was justified 
(isolated 1) continue hunting; 
otherwise (successive 1), back up, 


restore the erased 1, and 


enter the final hunting loop: 
Hunt backwards for the initial 1. 
Erase this 1 when it is found 


Illustrative special Turing machine. 


The idea of the equivalence of two 


A 


rigorous definition of equivalence will not be attempted, since it would 
involve a system of recognition of data and states between two machines 
and is possible only within the framework of formal logic [6]. 

What we shall show, intuitively, however, is that the Turing machine 


is equivalent to the more sophisticated stored-program computer. 


Not 


the least of the difficulties is the fact that the former has only a finite 
number of states, whereas the latter has an infinite number of possible 
instructions, owing to the infinitude of data locations. Asaconsequence, 
the word length for instructions will be indefinite (as well as for data ifno 


round-off is desired). 
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5.5 Equivalence Theory 


We first of all note that the behavior of any Turing machine can easily 
be simulated by a stored-program computer (of sufficient storage} witha 
finite program. The semi-infinite tape can correspond toa semi-infinite 
portion of storage, and the program can be put into a separate portion 
containing two special registers S and L, indicating, respectively, the 
location of the state of the Turing machine and the location of the simv- 
lated position of the scanning head. The finite matrix of N states with 
alternative actions can be put into a finite part of storage. Itis now an 
elementary programming problem to see how the stored-program com- 
puter looks up the contents at L, looks up the alternative in S, executes 
the action required, and then changes the indicators S and L. The 
branching property (5.2) and the data location (5.3) are the only proper- 
ties required; that is, the property of altering an instruction address (5.+; 
need not be brought into play. Wecan seem to get away with less of a 
machine, since we can find the right address by successive choices, with- 
out extracting any instruction locations into an instruction directly [7]. 

The converse equivalence is more difficult, namely, the designing of a 
Turing machine with a sufficiently large number of states to simulate a 
stored-program computer. The literature fails to show any estimate of 
how many states are needed, nor shall we try to find such an estimate 
here. We merely outline a procedure for seeing the equivalence in- 
tuitively in several states: 

1. The Turing machine is improved to the extent that the scanning 
head reads and writes any finite number of symbols. Thus a symbol on 
a square can be marked (e.g., by a star) for later reference and will be 
readable. Thus the A, B,C, J, K, © components of (5.1) can be sepa- 
rated and distinguished. 

2. The Turing machine of item | is further improved so that it reads 
and writes a finite number of tapes at once. Specifically, we have six 
(register) tapes to store the A, B, C, J, J, K locations and contents, one 
tape to serve as the general storage, containing /ocations and contents of 
words suitably marked. 

3. By a comparison operation, the machine can match locations and 
fetch any (data or instruction) item by location and then store any item 
at a new blank spot on the tape. (Since the word length is generally 
unlimited, the fetched item may have to be destroyed in storage if a new 
item goes with the same location but 1s too long to insert.) 

4. The arithmetic is performed serially; so the problem is independent 
of the size of the numbers. This process is described in most books on 
computer design [8] (and may require another utility tape or two in 
item 2). 
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To give an example of the austerity of such proofs as are called forth 
here, we might just consider item 1. The symbols are all put in binary 
representation, so that the basic induction operation is a proof that a 
(modified) Turing machine that can read and write M symbols a, },... 
is equivalent to one that can read and write M*symbols aa, ab, ba, bb,.... 
This is shown in Fig. 5.3 (where the states, as usual, are encircled, and 
superfluous steps are omitted). 


Explanation 


(N states ) 


Machine knows a, does 
not know 8, (cannot 
decide on action). 


(N M states) 


Machine knows a, }, 
can decide on action 


(N M? states) 


Write a’ Moves to new location 
Move € by € 


(N M® states) 


Move completed 


Fic. 5.3. Simulation of M? symbol machine by 
an M symbol machine. 


We start with a machine with WN states that can read M? symbols 
(symbolically) ad in state J and can write M? symbols (symbolically) a’b’ 
moving 2¢€ units (e = 0,+1) and going into state J. When these steps 
are accomplished, the number of intermediate states augment the total 
to N(2M? + M + 1) states for a machine that reads only M symbols and 
does the “‘same’”’ operations. 

The reader may wish to construct the flow chart for item 2, indicating 
how the following double tape, 


abe'd... 


mnop*..., 
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with ! and 2 denoting the scanning heads, can be replaced by the single 
tape *ambnc'odp*® ..., where the scanning head is on some reserved sym- 
bol * between moves. 


5.6 Duality of Instructions and Data 


One principle that can be abstracted from the preceding description is 
some duality of instructions and data. Certain evidences may be cited. 
First of all, the extraction operations in Sec. 5.2 arise naturally for each. 
Therefore, instructions and data are both treated in the same arithmetic 
fashion by the stored-program computer. In the Turing machine, the 
number of symbols goes down at the same time that the number of states 
goes up. In fact, Shannon [9] describes a two-state Turing machine 
with very many symbols and asks for the machine which has a minimum 
value for the product of the numbers of states and symbols. 

One way of finding new areas in machine usage is to ask for instances 
in which instructions have not yet received the same treatment as data. 


5.7. Automatic Programming 


The idea of handling instructions like data constitutes automatic pro- 
gramming. ‘Toacertain extent, automatic programming is built into the 
circuitry of the machine, but usually the term denotes activity preparatory 
to conventional machine computation. As techniques become more 
advanced, such distinctions should vanish (through built-in features). 

The earliest form of automatic programming seems to have been the 
floating-address system [10] due to Wilkes (1951). The purpose of this 
system is to permit subroutines to be written with symbolic addresses so 
that the symbols may be translated by machine into permanent addresses 
(avoiding those portions of storage that might have been preempted by 
an earlier subroutine). 

A later development was the interpretative system, or “‘pseudocode.”” In 
this system one programs for a hypothetical machine, and the actual 
machine translates the program step by step and executes each program 
step between translations. ‘Thus the one-address analogue (5.5) of the 
stored-program computer (of Sec. 5.2) is an interpretative system (al- 
though usually the one-address code possesses a multiaddress interpreta- 
tive system). The interpretative system has the disadvantage, in regard 
to time, that the translation process is used every time the instruction is 
executed (very much oftener than the number of times the instruction 
was written in the original program). The floating-address translation 
occurs only once, however, for each time the instruction is written. 
Hence the pseudo code is an equivalent machine. 

Here it might be mentioned that one form of interpretative system 
would be of inestimable value to numerical analysts for the automatic 
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monitoring of round-off errors. Symbolically each A, B,... has an 
associated error AA, AB. ‘Therefore, of considerably more value than 
A © B -C, would be the pair of calculations 


A®B-C (5.6) 
max |A © B — A’ @ B'| > AC, (5.7) 


where 4’, B’ are subject to |A — A’| < AA,|B— B’| < AB. This refine- 
ment would double the data-storage requirement to include the AA, 
AB,... and triple the running time (at least). Unfortunately the 
program would have to interpret A © B by a subroutine each time it 
occurred. (The same could be done for truncation error if the general 
term ofaseries was given.) If, after suitable trials, the numerical analyst 
decided to ignore round-off analysis or change to a simple scaling or 
floating-point method, he should be able to make the new interpretation 
of his code in whole or part, entirely by machine. 

An interpretative system that is translated before use is called a compiler 
[11]. It is usually much more elaborate than a mere floating-address 
system, and it requires an amazing degree of complexity in the translation 
of even simple phrases of pidgin English. ‘The details of such systems go 
far beyond the scope of this chapter, and probably the manuals of lead- 
ing computer manufacturers are the best source of information. Two 
points, however, are of interest to users of computers and to numerical 
analysts in particular. 

First of all, experiments in machine learning actually point in the direc- 
tion of helping the customer get answers. Machine learning (or con- 
ditioning) is the process whereby the machine performs a subroutine in 
a manner dependent on the previous “‘experience”’ of the machine, with- 
out the user’s being in attentive control. This might require a large file 
of previous (similar) problems to be called upon by the machine without 
the specific knowledge ofthe user. It might, however, just involve fairly 
current calculations. ‘This learning process would be a valuable pro- 
cedure when it takes too long to decide each case ahead of time, for the 
machine could then sample, say, every dozen usages, or else it could 
apply to the next case information available only after the subroutine is 
completed each time. A more daring procedure [12] is to extend the 
learning process from data to instructions and have the machine write a 
sequence of instructions by trial and error. A prohibitive factor, in- 
cidentally, could be the machine time involved in translating English 
syntax or using trialanderror. It seems that automatic programming, 
rather than larger reactors, is the more creative justification for faster 
machines. 

A final point is a rather gloomy one. A human being can decide 
better than a machine whether a statement is translatable into machine 
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language. This isa form ofa mathematical theorem (notamere human- 
istic sentiment), and it is presently discussed further. 


5.8 Undecidability: Godel’s and Turing’s Problems 


A good starting point for seeing limitations of a computing machine 1s 
some kind of paradox [13]. Then the process of putting the paradox 
into machine language must have a weak spot, which must turn out to 
be machine decidability. For instance, consider the famous paradox 
concerning phrases of English that represent integers. We consider the 
connotations of all phrases of 11 or fewer words. Most of them represent 
gibberish, or at least they represent nointeger. Those that represent an 
integer can be enumerated lexicographically, and the largest integer, say 
M, can be selected. We then consider the phrase, ‘‘One plus the largest 
integer representable by eleven or fewer words.” If we count the num- 
ber of words, we find that in 11 words we have represented Af + 1, pro- 
ducing a contradiction to the definition of Mf. From the machine point 
of view, we need not worry about the paradox, since it only proves that 
no program can enable a computing machine to examine a phrase of 
English and decide (with the help of however many encoded rules of 
syntax, grammar, etc.) whether or not the phrase connotes an integer. 

The limitations of a machine, put succinctly, are that no program exists 
which enables a machine to examine any program and to decide in a finite number 
of steps on a property that effects visualization of the infinitely many possible steps 
of the program. We display this limitation by two different arguments, 
which, however informal, are more palpable than representation by 
“‘phrases of English.”’ 

To do this, we first imagine all programs written out as instructions 
and data (with locations), console settings, etc. Thus any program is a 
finite sequence of ordered integers only (considering alphabetics to be 
treated numerically, namely, a, @g,...,@,). The number k = 23% 
--+p,,7™ (where p,, 1s the mth prime) corresponds uniquely to the pro- 
gram. It 1s called the Gédel number of the program. 

Adapting Goédel’s procedure [14] to machines, we define a program as 
deciding on a property if and only if it is so constituted that the program 
starts with a variable number nin a special register and stops with + 1 or 
—] in that same register, depending on whether or not has the property 
of the program. Clearly very few programs will do this, but we assert 
that there is no machine program for deciding whether an arbitrary num- 
ber is the Gédel number of a property-deciding program in particular. 

For ifso, the machine program could be modified slightly to enumerate 
all property-deciding programs, and the machine could consequently 
enumerate all integers (in numerical order) that represent property- 
deciding programs. Hence the machine could, by a finite program, 
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decide the truth or falsity (+1) for any 2 of the property that “‘n is not 
satisfied by the nth property-deciding program.” But this property 
would have a Gédel number in this numerical sequence of programs, 
say g, and the question of whether q satisfies the gth property-deciding 
program leads to a paradox. Hence no finite program can identify a 
property-deciding program. 

Turing’s form of the paradox [15] is quite similar, except that it can 
be adapted to fit a machine more closely. We consider some attribute 
relative to the indefinite running of a program—for instance, the question 
of whether error stops ever occur in the running of the program. There 
does not extst a program which enables the machine to examine an arbitrary program 
to decide in a finite number of steps whether error stops occur. 

To prove this statement, let ussuppose the contrary. Then’a program 
@ exists, such that an arbitrary program & could be examined in finite 
time to decide, by just running the combination (.#,.%) (.W as a pro- 
gram followed by »~ as data), whether or not ©& produces error stops. 
We can then modify the procedure by saying that, since we assume we 
can test the running of .%, we can also test the running of program 7 
followed by data #/, or the combination (.%,%#). This combination 
involves a procedure for reading a program into the machine, having the 
machine acknowledge the end of the program, and then having the 
machine read data to do what it will (which even may include the error- 
stop rejection of data ©& following program x). Thus, most often 
( , ) will be meaningless; but regardless of this fact, on our assump- 
tion that .@ exists, the machine could then tell whether (.%,.%7) will 
produce an error stop. Let the machine announce the effect of (.%, x7), 
by the running of (4, ¥), as follows: if (.,.27) becomes known to pro- 
duce an error stop, let (.#,.#) run to a normal stop; if (.%, a) produces 
no error stop, let (.@&,.#) run to an (artificially produced) error stop. 
Then the running of (.4&,.#) leads to a paradox (seemingly neither pro- 
ducing nor failing to produce an error stop). 

Note that the programs that have been shown to be nonexistent are 
programs that do things which the average programmer can very often 
decide (in fact is expected to decide). The programmer still has no 
right to say he has a “‘method”’; he is safer when he attributes his decision 
to an inspired guess or to good fortune. ‘True randomness, if it exists, 
acquires an even greater role by defeating the undecidability argument 


[16]. 
3.9 Concluding Remarks 


We have gone through a rather rapid description of the logical basis 
for computer usage. Actually numerical analysts have not yet carried 
their programming anywhere near the limits of computer capacities. 
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They can do this only by further ventures into automation in directions 
that may not be clear today. Yet even so, their achievements are 
circumscribed by the machine’s inability to predict the running of its 
program, and these restrictions can be lifted only by new types of 
randomization. 
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6.1 General Introduction 


The study of matrix computations—in which we include the evaluation 
of determinants, the solution of simultaneous systems of linear equations 
and linear inequalities, the inversion of matrices, the determination of 
characteristic values and vectors of a matrix, and the solution of the 
general characteristic-value problem Ax = ABx—has been most intense 
in the last two decades. We cannot attempt to cover this field, but we 
shall discuss some representative methods. 

For an account of some of these problems from the point of view of 
desk computation, we refer to Fox [1]; the books of Crandall [2] and 
Frazer, Duncan, and Collar [11] are valuable accounts of the use of these 
methods in actual practice, as are various accounts of “relaxation” 
methods (see Forsythe [6]).. Expository accounts, with worked ex- 
amples, are available, for instance, in Taussky and Todd [8], in [9], and 
at greater length in Faddeeva [10]. 

Among the more elaborate texts are those of Bodewig, Dwyer, Zur- 
miihl, and Householder, for which detailed references are given in Sec. 
2.43. See also Wilkinson [37]. 

Four volumes of the National Bureau of Standards Applied Mathe- 
matics Series [3, 4, 5, 13] are devoted to various aspects of matrix calcu- 
lations. A symposium on the subject was held at Wayne University in 
1957, but the papers are rather scattered (see [18] for abstracts). 

From the current point of view on high-speed automatic computers, 
there are four classical papers: Turing [14], von Neumann and Gold- 
stine [15], Givens [16], and Goldstine, Murray, and von Neumann [17]. 

Among those who have made notable contributions, in addition to 
these cited above, are Aitken, Bauer, Hestenes, Rutishauser, Stiefel, 
Varga, Wilkinson, Young, Forsythe, Householder, Lanczos, and Ost- 
rowsk1. 

222 


ee gle 


MATRIX COMPUTATIONS 223 


THE SOLUTION OF LINEAR EQUATIONS 
AND THE INVERSION OF MATRICES 


6.2 Introduction 


It is convenient to distinguish between direct methods and indirect, or 
iterative methods, as exemplified by the (direct) Gaussian elimination 
method and the (indirect) Seidel method. It is dangerous to make 
definite statements about their relative merits, but it would appear that 
the Gaussian methods are preferable in the case of general matrices, 
whereas the Seidel method and its developments (see Chap. 11) are 
successful with “‘sparse’’ matrices, such as those which arise from the 
numerical solution of differential equations. 

The classification of methods has been studied by Forsythe and House- 
holder in various papers. 


6.3 Indirect Methods 


The following method for computing the reciprocal of a number is in 
common use. Let a be a nonzero number. Let x, be arbitrary and 
define the sequence 

X41) = x,(2 — ax,), r >0. (6.1) 


Then, in appropriate circumstances, x, > 1/a asr —- oc. To see what 
these circumstances are, sety, = 1 — ax,. Then, upon multiplication by 
a, (6.1) implies that 


VV 
= 


Sr+i = r (6.2) 
so that 
R=. SO, (6.3) 


Thus, if xp is chosen so that the modulus of 7) = 1 — ax, is less than 1, 
then y, > 0 asr — oo, and consequently x, + 1/aasr — 0. Similarly, 
ifx, + 1/aasr — o, then y, + 0 asr— oo, and 3, must be of modulus 
lessthan 1. That is, the process defined by (6.1) produces the reciprocal of a if 
and only uf the initial approximant x, 1s chosen so that|1 — axy| <1. Thissim- 
ple remark finds application later in an analogous matrix problem. 

A measure of the error engendered by (6.1) is_y,, and (6.2) shows that 
this error becomes squared at each step—a computationally pleasant 
fact. 

Example. Let us take a = 7, x) = .!. Then 1 — ax, = .3, so that thecon- 
ditions for the convergence of (6.1) are satisfied. We have x, = .13,+*, = 
1417, x3 = .142 84777, x, = .14285 71422 ..., so that x, agrees with }7 to 9 
places of decimals. 

If instead of x) = .1 we took x, = .5, say, then 1 —axyg = —2.5, and process 
(6.1) would fail. Thus x, = —.75, x, = —5.4375 and x, — —@ asr—>» 0. 
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If ais anumber, it is easy to decide when a’ +Qasr— oo. This hap- 
pens if and only if |a] < 1. The corresponding question for matrices, 
however, is not so trivial, and we develop the necessary and sufficient 
condition that a square matrix A with complex entries satisfy lim A’ = 0. 


Let E be the n x n matrix which has | in positions (1,2), (2,3), ..., 
(n —1,n)andQelsewhere. Then E*,0 <k <n — 1 isthe matrix which 
has 1 in positions (1, & + 1), (2,4 + 2),..., (2 — k,n) and 0 elsewhere. 
Furthermore, E7 = 0. We put 


B=d1— cE. 
Then Br = (al + cE)" = ¥( (ar RhE, 
k=0'°", 
Thus, ifr >n — 1, 
n-l/,' 
B= (;,)artet (6.4) 
k=0 


Since ({) Ar-* +0 as r— oo if and only if |4] < 1, we can make the 


following statement. 

Lemma 6.1. B’' —0Oasr— owifand only if \a| < 1. 

Suppose now that Ais ann x n matrix with complex entries. From 
the Jordan decomposition theorem we know that there is a matrix S and 
a matrix Csuch that A = S-!CS and Cis a direct sum of B matrices: 


C=8,+8,4:-:-°4+8, 
where B, = A.J!) + 644, 1 <k <t. Herel’ and E'™) denote I 
t , 

and E matrices, respectively, of orders n, with > n, =n. Since A’ = S-!C'S, 
we conclude that A’ +0 asr > oif and only if C’ +0 asr — o. 
But this happens if and only if B,’ —Qasr— o,1 <k <t. I=fwe notice 
now that the A,’s are just the eigenvalues of C (and so of A, since A and C 
are similar), Lemma 6.1 implies our first important result. 

Theorem 6.2. A’ —>0 as r— o if and only uf each eigenvalue of A 1s 
of modulus less than 1. 

This observation underlies many iteration schemes in computation 
with matrices. Here we adapt scheme (6.1), which produces the recip- 


rocal of a number, to the computation of the inverse of a matrix. Sup- 
pose that A is a nonsingular matrix and Xj, an arbitrary matrix. We 


econe X..,=-X(21-AX), 120, (6.5) 
and we put Y, = J — AX,. Then, as with (6.1), we have 
Yi = ¥, 
i re 


Google 


MATRIX COMPUTATIONS 225 


Thus X, + A-! as r —> oo if and only if Y, +0 asr-— o; and Y, +0 
asr—» oo if and only if the eigenvalues of Y, = J — AX,are all of modulus 
less than 1, by Theorem 6.2. Thus we have the following result. 

Theorem 6.3. A necessary and sufficient condition for the process (6.5) to 
produce A—' 1s that Xq, the initial approximant, be chosen so that each eigenvalue of 
I — AX, is of modulus less than 1. 

Example. We take 


11 
4~lial 
12 
so that 
st=[ ei 
“11 
We take 
1.9 —.9 
sls sl 
9 9 
Then 1-Ax=|° ” | 


has eigenvalues 0, .1 which are both of modulus less than 1. The conditions 
lor the convergence of (6.5) are satisfied, and we have 


1.99 —.99 
x= | oo ool” 
—.99 99 
7 1.9999 Benn 
2" L—.9999 9999 |’ 


so that X, is in good agreement with A7!. 

The process suffers from certain defects, the most serious one being 
that in practice the choice of X, is difficult, if not impossible. If, how- 
ever, an approximate inverse of A has been determined by some other 
method, then (6.5) finds its proper role as an improvement scheme, and this 
is how it is generally used. Another defect is that a large number of 
figures must be carried to obtain any improvement at all, principally 
because two matrix multiplications must be performed. 

As another example, consider the recurrence 


X.,.=B4+(1—BA)X, 7120, (6.8) 


with ¥, = O and A nonsingular. Setting Y, = J — AX, as before, we 
hind from (6.8) that 
Y, = (I — AB)Y,_,, (6.9) 


Y. = (I — AB)’. (6.10) 
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From (6.10) we can conclude that X, — A-! as r + oo if and only if 
each eigenvalue of J — AB (or of J — BA) isof modulus lessthan1. Con- 
trary to the previous process, the error matrix is not squared at each step 
but is multiplied by a constant matrix, as shown by (6.9). The conver- 
gence therefore cannot be as rapid as that of (6.5). Scheme (6.8) has 
the advantage, however, that each step requires just one matrix multi- 
plication instead of two, and this is occasionally an important considera- 
tion. 


Example. As before, we take 


so that 
qe | 2 ~ |. 
=e l 


We take 


so that 


has eigenvalues 0, .1. Process (6.8) will therefore succeed, and we find 


ss a 1.99 7) 
X, = 
—.99 .99 


The improvement, roughly speaking, is one digit at a time. 


Choosing B = Jand setting J — A = C in (6.8), we see that X, = J + 
C-+-++++C’-!. From the preceding discussion we obtain the following 
useful corollary. 

Corollary 6.4. The identity 


I—-Cyt=14+C04C?4+--- 


1s valid uf and only uf each eigenvalue of C 1s of modulus less than 1. 

The preceding discussion shows that it 1s of interest to possess some 
simple criterion for deciding whether or not the eigenvalues of a matrix 
are inside the unit circle. Let 2 be an eigenvalue of A = (a,;) and x = 
(X1,X%2,--.,X,)’ a corresponding eigenvector, so that Ax = Ax. Let m be 
a subscript for which x,, isa maximum (and socertainly not zero). Then 


} 
A amm > > ain; 5] 
jrzm m 
|A es Ann = > la njl- 
j#m 
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Thus A is contained in the circle in the complex plane with center a,,,, 
and radius > |a,,;|._ Thus we have the following result. 


jem 
Theorem 6.5. The eigenvalues of A lie in the union of the circles 
|z —a,,| < > |a;;\, l<ic<n. (6.11) 
jFi 


This theorem is known as the Gersgorin circle theorem and (6.11) as 
the Gersgorin circles. 
For developments of this and related results, see Chap. 8. 
Example. The matrix 
awa 
—.1 .1 


which occurred previously has eigenvalues 0, .1. Theorem 6.5 shows that 
these eigenvalues lie in the union of the two circles |z| < 0, |z — .I| <.l and 
so certainly are inside the unit circle. It happens quite frequently that this 
theorem obviates the need of computing the eigenvalues. 


Corollary 6.6. If the sums 


2, [4 la,;|, 1 <i <a, 


are all less than 1, or uf 
y la;;|, l<j <a, 
i=l 


are all less than 1, then all the eigenvalues of A are inside the unit circle. 

We want to associate with a matrix A a number which will measure its 
“distance” from the zero matrix or, more generally, will measure the 
distance between two matrices A, B. More precisely, we require a real- 
valued function N having the following properties: 


N(A) >0 and N(A) = 0 if and onlyif A = 0, (6.12) 
N(cA) = |c| N(A), ¢ a complex number, (6.13) 

N(A + B) < N(A) + M(B), (6.14) 
N(AB) < N(A)N(B). (6.15) 


There have been recent studies of norms by Faddeeva [10] and Ost- 
rowski [29]. 

Properties (6.12) and (6.14) are the usual properties ofa distance func- 
tion in a metric space. Property (6.13) is familiar also, but (6.15) 1s 
peculiar to matrices. It is not evident that such functions exist, and we 
define 


= (x lal? ’ (the Frobenius norm), (6.16) 
ij 
M(A) = nmax |a,,|. (6.17) 
aj 
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Properties (6.12) to (6.15) are tolerably obvious for M, whereas (6.12) 
to (6.14) are also clear for F, if A is regarded as a vector in euclidean 
n®-dimensional space and F(A) as its length. As for (6.15), if A and B 
are conformal (that is, if it is possible to form AB), then 

F(AB)? = 313 dubs 
tJ 


2 
= > (x laabsl) 
tj\ ek 
= > 14.5 ,;4;50.;| 
1,j,r,8 
ae : > |a;,b,,| |a,,5,5| 
1,j,7,8 


<_Z HM laiebasl® + laiadesl*) 


3,J,7,8 


= > |a,,5,;\" 


1,),7,8 
= >» la;,l? > 6,51? 
ar 8,j 
= F(A)?F(B)?. 
We note some additional properties of the Frobenius norm. Let A* 
denote A’. Then 
F(A)? = tr(AA*) (6.18) 
where tr A denotes the trace of A. Let U and V be unitary matrices; 
that is, VU* = VV* =I. Then also 
F(UAV) = F(A). (6.19) 
This holds because 
F(UAV)? = tr (UAVV*A*U*) 
= tr (UAA*U*) 
= tr (AA*) (since AA* and UAA*U* are similar) 
=F (A)*. 
Using (6.19) and a result of Schur, we can derive a significant in- 
equality for the eigenvalues of a matrix A. Schur’s result is that an 


arbitrary complex matrix A can be transformed into triangular form by 
a unitary matrix U. Now, for a triangular matrix T = (t¢,;), we have 


Diu < FT) 


with equality if and only if 7 is diagonal. We obtain, therefore, the 
following theorem. 
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Theorem 6.7. Let A have eigenvalues A,, Ag, ..., A, Then 
lal? +o + [Agl? < F(A) (6.20) 


with equality if and only uf A ts unitarily equivalent to a diagonal matrix. 

It is worth noting that a necessary and sufficient condition for A to be 
unitarily equivalent to a diagonal matrix is that A be normal, that 1s, 
that A satisfy AA* = A*A. 

Clearly we always have 

F(A) < M(A). (6.21) 


In terms of a norm N, we can now say that a sufficient condition for the 
process (6.5), or (6.8), to produce the inverse of A ts that N(I — AX,), or 
NU — AB), respectively, be less than 1. 

Example. The Frobenius norm of the matrix 


al 
—.1.1 
occurring previously is 4 9V 2, so once again we are in a position to infer that 
processes (6.5), or (6.8), will converge with X,, or B, as chosen. 
From (6.6), (6.7), and (6.15), we obtain the quantitative estimates for 
process (6.5): 
N(¥,ax) < N(Y,)4 (6.22) 
N(Y,) < N(¥0)", (6.23) 


where N is any norm. Quite often it is better to have directly an 
estimate for N(A-! — X,) than for N(Y,). Let us suppose that a state 
in the computation has been reached such that 


N(Y,) <1. 
Then we have 
AX, =I — Y,, 


A AY = 7). 
By Corollary 6.4 and (6.20), we have 
AA See apd Peres 
AA -X,=X(Y, + YF +0); 
and using the properties of a norm we find easily that 


nas — x) < MOND 


This estimate is useful computationally, since the matrices X, are 


(6.24) 
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available at each iteration. For the process (6.5), we have as a 
consequence of (6.23) and (6.24) that 


N(X,) N(Y,)2" 
N(A-1 — X,) 3 


The next process we describe, the Gauss-Seidel process, is a practical 
iterative scheme for finding the solution of a set of linear equations in 
certain circumstances. We let A = (a;;), 5 = (b,,b9,...,5,)’, and de- 
fine a sequence of vectors 


(6.25) 


, 
x, =— ee ae eee yg xy!" ] ) r = 0, 


as follows: x, is arbitrary, and x,,, 1s obtained from x, by finding the 
solution of the triangular system 


Ga1Xy'F") + Agoxgi*) + 25+ + dayx,') = bg (6.26) 
4,3%,'Tt)) + asx" ™ foicee a4 AY) =D. 


The matrix equivalent of this process is as follows. Write A = L + U, 
where L is the lower triangular part of A and where U is the part of 
A above the main diagonal. Then we have 


Ex... + Ux, = d. (6.27) 
We assume now that L is nonsingular, and we put a = L714, 


C= —-—L-1U: Then 


Xy41 = 4 ae Cx,, 
from which it is easy to show that 
x= (74 C4+-°->4 C7 Ya + Cx. (6.28) 


Taking into account Corollary 6.4, we see that the sequence of 
vectors x, converges for any vector 4 if and only if each eigenvalue of C 
is of modulus less than 1. For the limit of the sequence (when it 
converges) we have, from (6.28), 

x =lmx, = J —C)la = A710, 
which is the solution of the system Ax = b. We have, then, the following 
result. 

Theorem 6.8. The Gauss-Seidel process defined by (6.26) produces the 
solution of the system Ax = b for arbitrary b uf and only if each eigenvalue of 
the matrix C defined above is of modulus less than 1. 

Other schemes in terms of different decompositions of A are possible. 
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| i 
A= : 
1 2 
10 
1 2 00 


0-1 
Then C= —-L1U = | A 


0 


has eigenvalues 0, .5, so the process will succeed for arbitrary 6. Choosing 
b = (1,1)’ and xy = 0, we have the system 


Example. Take 


so that 


gern + oy) _ l, 
from which we obtain 


me ety a aa r >], 
so that x, — (1,0)’ as r—+ oo. This is the solution of the system 
X, +x, = 1 
The criterion of Theorem 6.8 is rather obscure, and it is of interest 
to look for sufficient conditions which will guarantee the success of the 
process. In this connection we prove the following. 


Theorem 6.9. Let F, G be matrices such that F ts nonsingular and F +G 


and F — G* hermitian positive definite. Then the eigenvalues of F—-'G are all 
inside the unit circle. 


Proof. Let A, x be an eigenvalue and eigenvector, respectively, of 
F-1G: F-Gx = dx. Then Gx = AFx, x*Gx = Ax*Fx. Adding x*Fx 
to both sides of the latter equation, we obtain 

x¥(F + G)x = (1 + A)x*Fx. (6.29) 

Now (6.29) shows that A # —1, since otherwise x*(F + G)x = 0, 
which cannot happen, F + G being positive definite. 

Since F + Gis hermitian, we obtain from (6.29) 

(1 + A)x*F*x = (1 + A)x*Fx 
= (1 + A)[x*(F — G*)x + x*G*x] 
=] : A)[x*(F — G*)x + Ax*F*x], 
(1 = |a[2)x4F*x = (1 + Ax*(F — G*)x. 
Making use of (6.29) again and the fact that F + G is hermitian, we 
have 
(1 — jAl*)x*(F + G)x = |] + Alex*#(F — G*)x. 
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But both F + Gand F — G* are positive definite, and 4# —1. Thus 


1 — |A|? > 0, 


and the proof is complete. 

Suppose now that A is hermitian positive definite. Then A may be 
written as K + D + K*, where KX is the part of A below the diagonal and 
Dis the diagonal of A. Setting F = K + D,G = K* and noting that D 
is also hermitian positive definite, we have A = F + G, D = F — G*. 
Theorem 6.9 applies, and we obtain the following result. 

Theorem 6.10. The Gauss-Seidel process will converge if A is hermitian 
positive definite. 

We now observe that Theorem 6.10 is actually of universal applica- 
tion, since the system Ax = 6 may be replaced by thesystem A* Ax = A*4, 
and A*A is hermitian positive definite when A is nonsingular. This 
multiplication is of dubious value, however, and is not recommended. See 
the Rutishauser example below and Taussky [32]. 


6.4 Direct Methods 


Among the simplest systems are the triangular ones: 


41%] = 5b, 
Ag)X1 + AgoXe = by 
A3\X, + 39X%q + A33%3 = b, (6.30) 
ani*1 zl a,o%9 ae 23% 3 = we - AnynXn = bn 


Provided that a,,4.. °-- a,, # 0, the system (6.30) can be solved recur- 
sively as follows: 


x, = b,/a,, 
Ky = (by — 49)X1) [go 


X3 = (by — @3)X) — A3oXo)/433 


x, = (0, — AyyXy — AyoXy — * °° — i ihazt) | ans 


We may therefore consider the solution of a triangular system as com- 
pletely settled. There is some advantage, then, in attempting to reduce 
an arbitrary system toatriangularsystem. Thiscan be done ina variety 
of ways. Thus if A is an arbitrary matrix, we try to find a lower tri- 
angular matrix £ and an upper triangular matrix U such that 


A = LU. 
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If A = (a,;), L = (l,,), U = (u,;), this is equivalent to the system of n? 
equations in (n? + n) unknowns: 


Our latitude lies in the specification of the diagonal coefficients. Let us 
regard the diagonal elements of Las known. Then the remaining ele- 
ments of L and the elements of U may be determined stepwise in the 
order 


Urry Ug, Uyg, ++» Uin 
Io1y Ugg, Ugg, > >» Uan 


Isa) Uses Ugg, ++ +» Usn 


It is important to notice that not every matrix A can be written as 
A=LU. For example, the 2 x 2 matrix A = (a;;) where a,,; = a. = 0, 
Q\, = 4g, = | cannot be put into this form. 

Once A has been expressed in this manner, the system Ax = 5 becomes 
LUx = 6 so that x is obtainable by solving two triangular systems: 


x = U-(L-1b). 


An important instance not covered by this process occurs when A is 
hermitian positive definite. In this case we look for an upper triangular 
matrix 7 such that 

A=-T*T. 


It is worthwhile remarking that probably the simplest way to determine 
whether or not a given hermitian matrix A 1s positive definite is to try to 
express it as 7* 7; the attempt will succeed if and only if A is positive 
definite. In addition, the determinant of A is the square of the 
product of the diagonal elements of 7. 

The actual algorithm which produces 7 is quite simple. If T = (¢,;), 
we have 


min(t,j) | 
ay = Ceites 
k= 
Thus tal = Vai. ++, fin = Gin [ty 


tool = V Gag — |bygl®, -- + 5 fen = (Gon — biotin) [toe 


Itnnl = Wann — Itanl® — °° — |ta—anl?- 
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The quantities |¢,;|? are just the quotients of the consecutive principal 
minors [that is, @,,, (4,429 — |@39|2)/a,;, etc.], and in practice we would 
usually choose ¢,; to be real and positive. Because of the presence of the 
square roots, this is known as the square-root method. 
Example. The matrix of the system 
x, + %*,= 1 
x, + 2x, = | 


] 1 
i a 
which is symmetric positive definite. Thus we can write 
j , = ks A ‘3 hel 
12 fe tag IO aged 


from which ¢,, = 1, ty, = 1, teg = lL inorder. Thus 


1 1 10] /11 
k 7 =1; i | 
and the system (6.31) becomes 
Pelee Ba 
11J LO 1d \x, 1}? 
x 107-)'T1 07-4/1 
(*) . L IL (:) 
107)’ 10)/1 
- _ , = if 
‘] 
~ (,) 
The evident advantage of the square-root method over general de- 
composition is that only one triangular system need be solved, since 


(T*)-) = (T-)*. 


The next method we consider, the elimination method, is by far the 
best general method available for solving a system or inverting a matrix. 
Let A be an arbitrary n x n matrix, Banarbitraryn x k matrix. The 
method consists of performing elementary transformations (adding a 
multiple of a row to some other row, interchanging two rows, multiply- 
ing arow by aconstant) onthen x n + k matrix [A,B] until A has been 
reduced to the identity, when B will have become A-!B. Some care 


(6.31) 


1s 


so that 
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must be exercised here, and it is a good idea to choose the so-called 
pivotal elements to be the largest columnwise. 
Example. Consider the following system: 


3x, + x. + x3 = 0 (6.32) 
2x, + x, + x3 = 0. 


The steps involved in finding the solution are as follows: 


pee Original matrix. 
3 1 1 0 ee 
> 11 0 First pivot. 
3 | 1 O 
1 2 3 | Rows | and 2 interchanged. 
2 1 1 0 
1 %y% 0 Row | divided by first pivot. 
0% % 1 Rows 2 and 3 modified by row 1. 
0 WA Y% 0 Second pivot. 
1 o-Y%w-\% Row 2 divided by second pivot. 
| 01% % Rows | and 3 modified by row 2. 
0 0o-%-% Third pivot. 
ae Row 3 divided by third pivot. 
: : a Rows | and 2 modified by row 3. 


The solution is thus 


x —.0, x, = —l, = 4. 


A word about checking. Let 6, be the & x 1 vector (1,1,...,1)’. 
Then Ad, + Bod, consists just of the row sums of [.4,B] and so is easily 
computed. If the previous process is applied to then xn+k + 1 

_ Matrix 


C = [A, B, —A6, — BO], 


. then each row sum of C is initially 0, and this fact will remain true 

' throughout the computation. Thus, if at any stage some row sum 

| differs from 0 significantly, an error has been introduced. At the end of 
the computation, the largest row sum produced is a good measure of the 
error in the solution A-!B. 
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Example (with checking). For the system 
7x; — xX, = 3 
3x, + 5x, = 2 
the solution is as follows: 
7 —-1 3 a Original matrix. 
E 5 2 —10 
| KW % _% oa : enacts first pivot. 
i 384 54 st ow mene by row I. 
=e Second pivot. 
: 0 1%» wd Row 2 divided by second pivot. 


0 1 %s —*%s 
Thus x, = 1748, x, = %s, and the row sums are both 0. 
It is worth noting that the product of the pivotal elements is the deter- 
minant of A. 
If we are interested primarily in matrix inversion, rather than in the 
solution of equations, there are several possible approaches. We can 
solve the n systems 


First pivot. 


Row | modified by row 2. 


Ax, = 4,, 1 <t <n”, 


where the 6, are the n unit vectors. The inverse of A is then [x,,x,,..., 
x,]- Or wecan do this at one blow by choosing B as the identity matrix, 
in the previous discussion. Quite often it is desirable to avoid carrying 
the identity matrix (to save space) or to solve n systems (to save time), 
and the following scheme is especially suitable for these purposes and is 
easily programmed for high-speed computers. 

Disregarding check vectors, our matrix initially is then x n + 1 mat- 
rix[A,0]. Atstepz,1 <2 <a, the first column of the matrix is scanned 
to determine the element of largest absolute value which has not pre- 
viously been used as a pivotal element in steps 1, 2,...,2 — 1. When 
this element has been determined, the number of the row in which it 
occurs, say k,, is recorded in a “permutation store.” The element so 
determined is 0 if and only if A is singular. Ifthe element is not 0, the 
kth unit vector is put into the (nm + 1)st column of the matrix. The 
k th row of this matrix is now divided by the pivotal element, and ap- 
propriate multiples of the k,th row are added to the remaining rows to 
make all first-column elements, other than the k,th element, 0. The 
entire matrix is now shifted left one column and the procedure repeated 
until z such steps have been accomplished. At this point, A has been 
replaced by its permuted inverse, and this is unscrambled as follows by 
means of the permutation 

) D.ahay 
(i. ky ae i.) 
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Rearrange the columns so that column | becomes column &k,, column 2 
becomes column k,,..., column becomes column k,. Now rearrange 
the rows so that row k, becomes row 1, row k, becomes row 2,..., 
row k, becomes row n. 

In practice, the unscrambling would be part of the print routine. 

The justification for this process is quite simple. Let P be the permu- 
tation matrix which has | in the s,th row and ith column. Then the 
process determines a matrix M such that 


M[A,P] = [MA,MP] = [P,MP]. 
Hence 


MA = P, MP = PA-'P, A-! = P'(MP)P". 
Example (without checking). We choose 


1 10 1 
a=|2 0 |. 
oS 3.2 


The computation is as follows: 


Step 3 performed. 


| 10 ] 0 3 Original matrix. 
| 2 0 l 0 First pivot. 
3 3 2 ] Third unit vector in last column. 
0 9 4 —l4 3. Step | performed. 
0 —2 -4 —% | 1 Second pivot. 
l ] 2% We 
9 iy —l4 ] 3. First column deleted. 
2 —l, —%% 0 | 1 First unit vector in last column. 
| 2%, 14 0 
l ly, —l4, \ 3 Step 2 performed. 
0 a, = — 2% 26 | 1 Third pivot. 
0 1%, %, -h% 2 
Vy, —ly, yy 0 3 First column deleted. 
147 —2%, 2% ] | 1 Second unit vector in last 
column. 
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The matrix 


“% MK ¥ 
204 84 214 
1% % 1% 


must now be unscrambled according to the permutation 


1 2 3 

3 1 2 
as follows. Column 1 becomes column 3, column 2 becomes column 1, 
column 3 becomes column 2, to give 


a le 
-% 1% 04 
34 We = 
Now row 3 becomes row I, row | becomes row 2, row 2 becomes row 3, to give 


% 1 10% 
uy Men 
66 274 204 
This is the inverse of the original matrix, as may be verified by direct multipli- 
cation. 

We note that one of the more efficient ways of evaluating determinants 
is by the triangularization of the matrix. This can be done at the ex- 
pense of about }4n° multiplications. Since the value of the determinant 
gives some information about the “‘condition”’ of a matrix or system, it 
is desirable that programs for matrix inversion or the solution of equations 
include computation of the determinant, which can be secured with 
negligible expense. 

Another method which is sometimes useful is inversion by partition- 
ing. Let us suppose that A is a nonsingular n x n matrix and write 


where aisanr x rmatrix,danr x smatrix,cans X rmatrix, anddan 
5s x s matrix. Here r and s are positive integers with sum n. We 
assume A~! partitioned similarly: 
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Then the relationship AA~! = J leads to the following system of equations 
for the unknown matrices a, f, y, 6: 


ax + by = 1%"), ap + bd = O08), 
ca + dy = 0U*"), cB + dd = 1%), 
from which we can deduce in general that 
a=a'!+a"'bo6ca”, B = —a 1b, 
y = —dca“}, 6 = (d — ca)". 


This of course necessitates that both a and d — ca~10 be nonsingular. 
Partitioning does not save time, but it makes possible the inversion 
of matrices too large for storage in the “fast”? memory of a high-speed 
computer. 
We note here that it is possible to invert matrices with complex ele- 
ments by real operations only. If 


A=X+1Y, X and Y real, 


X Y 
—Y X 
is of the same form, and if 


ra} “Lew 2): 


then Ait=Z+iW. 
6.5 Evaluation of Methods 


In the examples discussed, the matrices whose inverses were computed 
had integral elements and were of low order, the computations being 
carried through exactly. In practice this is not possible, and computa- 
tions must be carried out to a fixed number of places, thus introducing 
round-off errors. These are more critical in the direct methods and are 
serious enough to prevent completion of the algorithm and often to in- 
validate the results obtained. It is important to have some numerical 
estimate for this error in terms of the matrix. Such an estimate for a 
variant of the elimination method above has been given by von Neumann 
and Goldstine [15]. In their method, fixed-point arithmetic with scal- 
ing is used, and the matrix A to be inverted is always real and symmetric. 
The inverse produced is forced to be symmetric by identifying upper and 
lower triangular parts. For another treatment see Wilkinson [36]. 


then the inverse of 
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We define the P-condition number of an arbitrary nonsingular matrix 
A by 


where A is a root of largest modulus of A and yz a root of least modulus 
of A. A is assumed real, symmetric, and positive definite. Then, if the 
algorithm of von Neumann and Goldstine actually produces an inverse 
X, it is shown that 

[Ao] < 14.24P(A)nie, 


where A, is a root of largest modulus of J — AX, nis the order of matrix 
A, and « is the smallest number recognized by the algorithm (which in 
practice might be 2-*, or 10-1”). If, however, the algorithm fails to 
produce an inverse, then it is shown that A is nearly singular, and 


[ul < 10nte. 


In the case when A is not positive definite, the algorithm is applied 
instead to AA’, and 
Ate ALAA) 


is used to show that, if the algorithm produces an inverse X, then 
[Ap] < 36.58P(AA’)n2e. 


It is known that P(AA’) => P(A)*?. Thus matrices A for which P(A) is 
large can be expected to be troublesome, and it is reasonable to use this 
number as a measure of the “‘ill condition”’ of A, even when A is not of 
the class considered by von Neumann and Goldstine. 

We consider a significant example suggested by Rutishauser. We put 


] 
1 —l 0 
I -—2 1 

B= 
1 -—3 3 -l 
1 -4 6 —4 1 


Then the eigenvalues of B are either 1 or —1, sothat P(B) = 1. We 
put 


® 


A = BB =((' gt ey aes I 
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Then A is symmetric positive definite. We note that B? = J, so that 
A“! = B’B. Thus 
A = BAB, 


which implies that the eigenvalues of A and the eigenvalues of A! co- 
incide. From this we conclude that P(A) is just the square of the largest 
eigenvalue of A. Now, if 4 is the largest eigenvalue of A, we have 


A<trA <n. 


It is easy to show that positive constants a, 8 exist such that 
An 
a-— <trA <f-: 4", 
n 


and this allows the estimate 


log A ~ n log 4 
implying that 
log P(A) ~ 4n log 2. 


Thus A is exponentially ill-conditioned. 
Another difficult example is the Hilbert matrix 


H=((0+j7 + 1)-, O<1tjy<n—-1. 


It can be shown that #7 is exponentially ill-conditioned. The results 
of attempts to invert H for small n are given by Todd [31]. 

The question of evaluating methods for inverting matrices, in particu- 
lar the Gaussian method, has been discussed by Newman and Todd [30]. 
They give a set of test matrices and include representative numerical 
results from various machines. 

The time involved in the two processes discussed can be estimated 
roughly in terms of the multiplications they require, working with fixed- 
point arithmetic. With floating-point arithmetic a more thorough dis- 
cussion 1s required, since the time spent in performing the additions may 
not be negligible. ‘The Gauss-Seidel process requires about n? multipli- 
cations per iteration for a “‘full’’ matrix; it is difficult to make any general 
statement about the number of iterations required, except in the case in 
which the matrix has a dominant diagonal. 

The solution of a system of equations by the Gauss process requires 
about 4n3 multiplications. The inversion of a matrix by this process 
requires about n? multiplications: 14n° for the triangulation, Yén' to 
handle the right-hand sides, and )n? for the back substitutions. Ifwe 
used the Jordan variant of the process in which we diagonalize our 
matrix completely, so that no back substitution is required, we would 
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still need n? multiplications: 4en* for the diagonalization and 273 for the 
manipulation of the right-hand side. 

We conclude this section by discussing the solution of the svstem. 
(due to T. S. Wilson) 

10x, + 7x, + 8x3 + 7x4 = 32 

7X; + Sx, + Ox, + Ixy = 23 

8x, + 6x, + 10x3 + 9x, = 33 

7X; + Sx. + 9x, + 10x, = 31. 


10 7 8 7 
If A= 7 59 6 5 
8 6 10 9 
7 59 9 10 


is the matrix of the system, then A is nonsingular and 


25 —4l 10 —6 
Ahn —4] 68 —17 10 (6.34 
10 —17 5 —3 : : 


—6 10 —3 2 


The solution of the system (6.33) is thus x, = x, =*, =x, = 1. Ii 


we ask for a vector 
x(e) Pr [x1 (€),x2(€),*3(€),x4(€) |’ 
satisfying 


Ax(e) = (32 + €, 23 — e, 33 + «, 31 — «€)’, 
then it is easily determined from (6.3+) that 
x(e) = (1 + 82e, 1 — 136e, 1 + 35, 1 — 2le)’. 
Thus, for example, the vector 
(9.2, —12.6,4.5, —1.1)’ 


satisfies the system (6.33) with an error of +.1 in each equation, but in 
no sense can it be considered an approximation to the true solution 


(1,1,1,1)’. 


This indicates that, if the solution of (6.33) is computed with no more 
than 2 or 3 figures retained after the decimal point, trouble is likely to 
arise; and this indeed is the case. “The matrix A is ill-conditioned, with 
condition number 


P(A) = 3000. 
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It is important to note that the solution of this system by the Gauss- 
Seidel method is awkward. Starting with the zero vector, we obtain 
the sequence 

(3.20, .12, .67, .20 


) 
(2.44, .18, 1.06, .35)’, 
(1.98, .21, 1.28, .46)’, 
(1.55, .21, 1.44, .61)’, 


This slow convergence should be compared with that obtained from 
such a system as 


—2x, + X, = —] 
X; — 2xg + x5 = 0 
Xp — 2x, +x, =0 

X, — 2x, = —1 


l1+n—n? n? 
2 ioe Tis B= 
1 | -—- n> —n —n? 
n2 
Here F(AB — I) = 2 ; F(BA — I) = 2n. 
n 


THE DETERMINATION OF EIGENVALUES AND 
EIGENVECTORS 


6.6 Introduction 


In this context a direct method might be the determination of the 
characteristic polynomial, the solution of the characteristic equation, 
followed by the solution of the (singular) linear systems. Elementary 
theoretical considerations and the study of simple examples suggest that 
this method is not likely to be very satisfactory (see, e.g., Wilkinson [34], 
Goldstine, Murray, and von Neumann [17]). 

There are two essentially different problems, according to whether we 
require the dominant (or a few dominant) characteristic values or al/ the 
characteristic values. The solution to the first problem does not imply, 
IN practice, a solution to the second. For, in general, some loss of 
accuracy is incurred in determining the dominant root; and when the 
process of “‘deflation”’ is applied (the removal of the dominant root so 
that another one becomes dominant), there is a further loss of accuracy. 
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Repetition of this process involves “successive contamination,”’ so that 
results without significance may be obtained. 

We discuss first the Jacobi method of rotations, which was shown to be 
a practicable one by Goldstine, Murray, and von Neumann [17] and by 
Givens [16]. This method was first shown to be suitable for real sym- 
metric matrices and later for normal matrices [25]. The situation in 
more general cases is still rather obscure. 

There have been valuable studies of variants and developments of the 
Jacobi method by Forsythe and Henrici [19], Pope and Tompkins [20], 
Causey and Henrici [21], and Henrici [35], among others. The paper 
by Pope and Tompkins contains the results of various experimental 
computations. 

A standard method of finding the dominant characteristic root and 
associated characteristic vector is the power method. This method 
applies to any matrix A with a dominant eigenvalue and converges if the 
initial vector has a nonzero component of the dominant eigenvector. 
An important case in which A has a dominant eigenvalue is that in 
which A is a positive matrix, that is, the case in which all the elements of 
A are positive. This follows from the Perron-Frobenius theory (see, 
e.g., [33]). This theory and its extensions cover many important prac- 
tical cases (see, e.g., Varga [26]). 

Another method for finding the characteristic roots of real symmetric 
matrices is the Rayleigh quotient method. This has been investigated 
thoroughly by Ostrowski [24], who has shown that it is applicable in 
general cases. 

For a survey of methods for the solution of this problem, see White [22]. 


6.7 Rotations 


The properties of the orthogonal group are quite important in the 
determination of eigenvalues and eigenvectors, and we develop some of 
these properties here. 

The orthogonal group of order n, denoted by O,, is the multiplicative 
group of real n x n matrices R such that 


RR = 1. 


The elements of O, have determinant +1. The subgroup of O, consist- 
ing of matrices R with determinant | will be denoted by O,+ and the 
complex of matrices R of O, with determinant —1 by O,-. The ele- 
ments of O, are referred to as orthogonal matrices, and sometimes as 
rotation matrices. Evidently O,* is of index 2 in O, and 


O, = O,+ + KO,* = 0,+ + 0,- 
where Ko=(=1) 4+ 7", 
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A well-known subgroup of O, is the group P,, of permutation matrices. 
A permutation matrix is one in which each row and column contains 
just one nonzero entry, which is a 1. 

Suppose 1 <a< B <n. We define the matrices R, 3 = R,,.,(¢,5) 
as follows: R,; has —s in the («,8) position, s in the (8,«) position, 
cin the (a,«) and (8,8) positions, 1 elsewhere on the diagonal, and 0 in 
every other position. Thus, for some suitable permutation matrix P, 


¢ 6s er er 
k= P(|‘ +1 np’ 


We shall always require that c, s be real and 
Cap ge 


Then the matrices R,, are in O,+. We shall prove the following 
theorem. 

Theorem 6.11. 0,,+ 1s generated by the matrices R, 3. 

Proof. Let R = (7,;) bein O,+. Suppose thatr,, = +--+ =1r,,_-, = 0, 
rx * 0, for some k satisfying 3 <k <n. Then the first row of RR, , is 


(ry35 Tyg + Sry, 0, 0, ..., —Styqg + Cry, .--)- 
Since r,, # 0, we can choose 
C= Ty9(ry* + ry”), 
5 = ry (ty9? + y)-%. 


Then c? + s? = land —sr,, + cry, = 0. Thus we have shown that for 
suitable matrices R, , with 3 <r <n, the first row of 


W = RRyg sR 4 = Rein 


iS (W515@19;0,0, ... ,0). 
Since Wis orthogonal, we have w,,? + w,,. = 1. Now the first row of 
WR, 2 is 


(cw , + SwWy2, —Swy, + CWyg, 0,0,..., 0); 
and choosing ¢ = w;, 5 = Wy,, we find that this becomes 
(1,0,0,... ,0). 


1 0 
Thus RRz sRoa*** RonRis = } al 


But this matrix belongs to 0,+. This implies that v = 0 and that Rj, 
which is of order n — 1, is also orthogonal. The conclusion now follows 
by induction, since the theorem is certainly true for n = 2. 
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6.8 Givens’ Method for Real Symmetric Matrices 


The procedure used in reducing the matrix R in the proof of Theorem 
6.11 has the following important application. Let A be a real svm- 
metric matrix. Then by transformations of the type Ry ,AR, ; we can 
make the 


elements of A zero in this order by appropriate rotations 


Ros, Roa, Ros, e088 Ren 
R34, R3\55 arate »R;., 


Set B=R, int +t Resd Rog +0 Rain 


Then B can have nonzero elements only on the principal diagonal and 
on the two diagonals immediately above and below the principal di- 
agonal and is referred to as a triple-diagonal matrix. But A and B are 
similar, and so have the same eigenvalues. ‘Thus the problem of determining 
the eigenvalues of a real symmetric matrix is entirely equivalent to that of deter- 
mining the eigenvalues of a real symmetric triple-diagonal matrix. It 1s only 
necessary to apply the (mn — 1)(n — 2) transformations R,z as out- 
lined above. Suppose then that we consider a triple-diagonal matrix 


by Cy 
Cy b, Co 
Ce b, Cy 
a — 
Cn—2 byy Cn-1 
Cn-1 b, 


Expanding f, = det (xJ — C,) by minors, we see that the sequence /f, 
Jdiveeva St, Sauishess, = 1,5) = x = Oj; 

te =v (x To b.)fr-1 =s cy r—2) r>2. (6.33) 
It is well known [33] that the sequence 


DS eS ose so (6.36) 
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is a Sturm sequence, provided that no c, vanishes. That is, if V(x) 
denotes the number of variations in sign in the sequence (6.36), zero 
terms being omitted, then the number of zeros of f, in the interval [a,5] 
is just V(a) — V(b), provided that a and 6 are not zeros of f,. Thus, if 
noc, vanishes, it follows that the characteristic roots of C, are all real and 
simple, since f, is the characteristic polynomial of C,. (If some c, were 
to vanish, C, would become a direct sum, and each summand could be 
treated separately.) This can be developed into an effective method for 


computing the eigenvalues of C,. We illustrate this by examples. 
Example 1. Take 


ThenffZV=1,fA=*—-1,f. = x? — 3x + 1,f, = x? — 3x2 4+ 1. We make 
a small table of signs: 


x fg fh fr fo Vix) 


-l - + —- +4 3 
Orme Bes. a 2 
1 - — 0 + l 
2. es, ee: l 
Se a A a 0 


This shows that f, has one root in (—1,0), one root in (0,1), and one root in 
(2,3). If we are interested in, say, the root between 0 and |, we can proceed 
by successive bisection. 

It should be noted that in computing f,,(x) in general, the most economical 
way is by the recurrence (6.35). There is therefore no “‘waste”’ in this method, 
and it is highly recommended. 

Example 2. We consider the application of this method to the matrix 


5 7 6 5 
7 10 8 #7 
6 8 10 9 


Working to 6 decimals, we find the equivalent triple-diagonal matrix to be 


. 5 10.488089 0 0 
10.488089 25.472729 3.521903 OO 
0 3.521898 3.680571 —.185813 
0 0 —.185813  — .846701 
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The lack of symmetry in this matrix and the discrepancy between the traces 
give some indication of the errors incurred at this stage—a few units in the 
last place. 

It is now possible to apply the Sturm process to locate the roots. They are, 


approximately, 
01015, .8431, 3.858, 30.29. 


We note that the reduction to triple-diagonal form involves about 44n' 
multiplications. Thereafter each evaluation of the Sturm sequence involves 
n multiplications. 


6.9 Jacobi’s Method for Real Symmetric Matrices 


Once again the elementary orthogonal matrices R,, are employed, 
this time in such fashion that A is brought nearer to diagonal form at 
each step. 

We proved in (6.19) that the Frobenius norm is invariant with respect 
to unitary transformations, and since orthogonal transformations are 
just real unitary transformations, we have 


F(R'AR) = F(A), R orthogonal. (6.37) 


Suppose now that in the matrix A the largest off-diagonal element in 
modulus is a0. We lose no generality in assuming a = a,,. We 
choose ¢, s so that c? + s? = 1 and 


¢ S|[ ay, alle —s]  [5, 0 
—S C}| Gy, Ggo\{5 c} |O — dboy}° 


R= |‘ ~] 4 [(n-2), 
5 C 
Then (6.37) implies that 
Ay,” + 2ayQ” + ayo” = by,* + dgo?. (6.38) 


Let J(A) denote the sum of the squares of the off-diagonal elements of A. 
Then, using (6.37) once again, we find that 


J(R'AR) + by? + byg? + gg” + °° * + Gan? = S(A) + ay? + agg? 
eee ay ae) 
J(R'AR) + 64:7 + 529? = J(A) + 44)? + ap?. 
This, with (6.38), implies that 
J(R’AR) = J(A) — 2a,,?. 
Since J(A) < (n? — n)a,,?, this implies that 
Z 


n= —n 


We put 


J(R'AR) < ( i JA). (6.39) 
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Suppose that after & such transformations the matrix resulting is de- 
noted by A,. Then (6.39) implies that 


2 


n?=—n 


J(A,) < (1 me )' JA). 


From this result we conclude that by choosing & sufficiently large, A, 
can be made to differ from a diagonal matrix by as little as desired. 

For example, suppose A normalized so that J(A) = 1. Then, to 
make J(A,) < ¢, it is sufficient to choose 


log (1/e) a | 
bs 
= jog [(n® —n)/(n®? -n —- 2] 2 Pe 
Thus for a matrix of order 10 and « = 10-® we find that some 600 rota- 
tions may berequired. The processin any case isan “‘n®”’ process. Each 
rotation involves about 4n multiplications, so that we require O(n’) 
multiplications altogether. 


It is possible to determine the eigenvectors also, if the rotation matrices 
are saved. For if 


R'AR = D, D diagonal, 


then the columns of & are the eigenvectors of A. 
Example. Consider the diagonalization of 


2.879 —.841 —.148 .906 
—.84]1 3.369 —.111 .380 
—.148 —.111 1.216 —.740 

906 380 —.740 3.536 


The first stage is the annihilation of the (1,2) (2,1) elements. Working to 3 
places, an appropriate rotation produces 


2.250 .001 —.186 .636 
.001 4.000 .002 —.002 
—.186 002 1.216 —.740 
.636 —.002 —.740 3.536 


A further rotation is applied to annihilate the (3,4) (4,3) elements and then 
another rotation to annihilate the new (1,4) (4,1) elements; thus the matrix 
becomes 


4.005 —.002 .000 —.001 
—.002 4.000 001 —.002 
.000 001 1.004 .000 
—.001 —.002 000 =—-1.999 
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For further details see Taussky and Todd [8]. {It should be noted that the 
diagonalization has been accomplished more rapidly than might have been 
expected. The size of the off-diagonal elements and the change in the trace 
give some indication of the accuracy of the process. 


6.10 The Power Method 


The next method we consider will produce eigenvectors and eigen- 
values simultaneously. In addition, it is sometimes applicable when the 
previous methods are not. There are certain disadvantages, however. 

We choose an arbitrary vector v) normalized in some way—for ex- 
ample, to have one of its coordinates unity or to have the sums of the 
squares of the coordinates unity. We apply the matrix A repeatedly to 
the vector v'®), expressing each product vector as a scalar multiple of a 
vector in the chosen normalization. Specifically, if v) 1s normalized, 
then we define u'’*+)) by the equation 


Av!) = yltyG+D 


where v''t+)) is normalized. It can be shown that, if A has a single 
dominant root, then these multipliers tend to this value, and the (nor- 
malized) vectors tend to the corresponding (normalized) characteristic 
vector. 

The justification of this processissimple. Itis known that, commonly, 
a matrix A has n different characteristic roots A, and n distinct character- 
istic vectors c, which are linearly independent and which therefore span 
the whole space. An arbitrary vector v'®) can be expressed in the form 


Oe Se 
Since Ac; = /,c, fori = 1,2,...,2, we have 


yi) = Any = > LA 


and from this, if |A,| > |A,| 2--++ = |A,|, we have, for sufficiently 


large n (depending on the separation of the 4’s), 
OS AO) Ss eA Cin 


From this the statements made above follow: that v™ is approximately 
a multiple of c, and that the ratio of corresponding components of v'* 
and v'"-)) is approximately A). 
For a thorough discussion of this method see Wilkinson [23]. 
Example. Let us consider the case 


ae 9 1.32 
A= |—11.2 22.28 —10.72 
—5.8 945 —1.945° 
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and choose v = (1,0,0) and normalize by making the first coordinate unity 
for simplicity. We obtain the following results: 


Av = (.2, —11.2, —5.8) = po where pu!) = .2, 
yi) — (1, —56, —29) 


Av) — (—88.48, —948, —478.74) = pip!) where y'?)? = —88.48, 
v2) = (1, 10.7143, 5.4107) 


Ae) = (16.9850, 169.5119, 84.9534) = p03) where p38) = 16.9850, 
v3) — (1, 9.9801, 5.0017) 


Av!) = (15.7843, 157.5384, 78.8086) = pe) ~~ where p) = 15.7843, 
v4) = (1, 9.9807, 4.9928) 


Ac = (15.7731, 157.6472, 78.8316) = pO where yw) = 15.773], 
uv) — (1, 9.9947, 4.9979) 


Ae) = (15.7925, 157.9044, 78.9540) = p'%y!) where u'® = 15.7925, 
6 — (1, 9.9987, 4.9995). 


It can be verified that the exact results are A, = 15.8 and v, = (1,10,5). 


6.11 The Rayleigh Quotient Method 


The Rayleigh quotient method has some similarities with the power 
method, and the ideas in the two are often combined. 

The basis of the method is the remark that near a maximum x, of f(x), 
if f’(x) exists, then 


I(x) — flo) = O(x — x0)? 


This is applied where x is an n-dimensional vector and f(x) = R(x), the 
Rayleigh quotient for the matrix A in question, which initially is assumed 
to be real and symmetric and to have a dominant characteristic value A,. 
The Rayleigh quotient is defined by 


Ifxis a characteristic vector corresponding to the characteristic value 
4, then clearly R(x) = 4. 

The method proceeds as follows. Let & be an initial approximation 
toa characteristic vector. If 


ye ee = R(Eo) fo 


ls small enough to be considered zero, & is taken as the characteristic 
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vector. Ifnot, we determine an approximate solution €, to the nearly singular 
system 


AE = R(&q) é, 


and we repeat the process with &, replacing &g. 
A meaning for the phrase in italics which enables convergence to be 
established has been given by Ostrowski [24]. 


Example. Let 
3.500 .750 1.299 
AS | 750 1.625 83 ’ 
1.299 1.083 2.875 
We guess & = (1,1,1) and estimate the dominant characteristic value 
R(é) = 4.755. A “solution” to A€ = 4.755€ is &, = (1,0.5,.9). We find 
R(é,) = 4.999, and a “solution” to Af = 4.999¢ is & = (1,.5,.867). Since 
this gives R(é,) = 4.999 = R(é,), the process terminates here. 


6.12 Deflation 


The following remark is also worth noticing. Let A, be the first row 
of A. Let A be an eigenvalue of A and let x be a corresponding eigen- 
vector, normalized to have first component unity. We define 


A =A — xA,. 


Then A and A have the same eigenvalues except for A, which has become 
0, and there is a simple relation between the eigenvectors of A and of -4. 
Thus, suppose that A, is an eigenvalue of A and that x, is a corresponding 
eigenvector, normalized to have first component unity. Then, ifa+ 4,, 


A(x — 4) = (A — xA,)(x — a) 
= A(x — x,) — xA,(x — x,) 
= Ax — Ax, — x(A — A,) 
= A,(x — x), 


so that A, is also an eigenvalue of A, and x — x, is a corresponding eigen- 
vector; whereas if A = 4,, 


Ax = (A — xA,)x = Ax — xA\x = Ax — xd = 0. 


We note that the first row of 4 is 0 and that each eigenvector x — x, of 4 
has first component zero. It is only necessary, therefore, to work with 
the principal minor matrix of A obtained by striking out its first row and 
column. This process may be continued and is known as deflation, since 
the order of the matrix 1s reduced by unity as each pair A, x is computed. 
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7.1 Introduction 


The following problem is considered here: given an equation 


J(z) = 9, (7.1) 


where f(z) isa nonlinear function of z in the complex plane, find solutions 
z,, which are sometimes also called roots of f(z) satisfying this relation. 
If f(z) isa transcendental function, there may exist an infinity of solutions, 
whereas in the case of algebraic functions the number is always finite. 

In most cases it.is not possible to give an expression in closed form for 
the solutions of this problem. Even in those instances where such ex- 
pressions exist, they may not be very useful for numerical purposes. 
Therefore, they are not considered here. As is very frequently the case 
in numerical analysis, iterative methods which lead to approximations 
for the solutions will be the best tool for solving the stated problem. 
There already exist a large number of them, some better than others, but 
up to now there is no one method which 1s the best in all cases. The 
advantages and disadvantages of each method depend to a large extent 
on the particular nature of the problem at hand. In general, one can 
distinguish three categories of problems: 

1. Determination of a single root whose approximate situation is 
known. 

2. Determination of approximate values for all the roots. 

3. Determination of the number of roots in a given region of the 
complex plane. 
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‘The methods which are most efficient in solving these problems are not 
the same for all three, although each of them can be used in all the 
mentioned cases. It is, for instance, quite obvious that a procedure 
which finds all the roots can be applied also in the first and third case: 
there, however, it may be very wasteful with respect to the computational 
work which has to be done. 

For this reason, the three categories of problems are treated separately. 
In each case, some methods of solution are given which have proved to 
be effective and efficient in practical work. Because of the limited scope 
of this discussion, no attempt can be made to be complete in the exposi- 
tion of available methods. For additional information on such methods, 
see the list of recent papers at the end of the chapter; for the older 
literature, the textbooks, in particular [1,2], can be consulted. An 
evaluation of the relative merits of the different methods is verv 
difficult, since it depends much on the particular case at hand. For 
this reason, one should not infer that any method which is not con- 
sidered here has no practical value. The selection here is based mainly 
on actual experience in numerical work with automatic digital 
computers. It is therefore mainly intended for the users of such 
machines. 

The following discussion is restricted to the consideration of polv- 
nomials whenever this is convenient. This simplification is justified, 
since in digital computers any transcendental function has to be 
represented by rational approximations, so that the problem of finding 
zeros of such functions reduces, from a practical point of view, to finding 
zeros of polynomials. 

When numerical methods, to be used with digital computers, for find- 
ing zeros of polynomials are considered, one fundamental question arises 
with respect to the meaningfulness of the results. Since in most prac- 
tical cases the coefficients of the given polynomials can be introduced 
into the machine with only a limited accuracy, it 1s important to know 
whether, irrespective of the magnitude of the coefficients, a certain num- 
ber of significant digits can be guaranteed for the roots. This prob- 
lem has been considered by Ostrowski [32], who proved the following 
theorem*. 


* Note, that this result is frequently not very practical. (The reader should deter- 
mine the 7 necessary to ensure a relative error of, say, | per cent, in the casen = 10.1 
On the other hand, the result is near the truth, e.g., if f(z) = (z — 1)", g(z) =fiz' 
—¢. On the whole, the stability of the roots as functions of the coefficients should 
be studied, usually experimentally, in any particular case. The paper of Wilkinson 
[33] includes discussions of many particular cases and points out that violent changes 
in the roots can be observed even in cases when the roots are well separated. The 
paper of Olver [38], although mainly addressed to desk computers, contains valuable 
worked cxamples. 
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Theorem 7.1. Lei 
f(z) = 2 G;2’, g(z) = > b,z’ 
7=0 j=0 
be two polynomials whose corresponding coefficients differ only that much from each 
other that there exists at # 0, with 4nzr¥" < 1, for which 
|; a a,| <T la;|, J — 0, l, 2; coy M. (7.2) 


([t ts assumed that a), a, #0.) Thenif we designate by x;(t = 1, 2,..., 7) 
the roots of f(z) and by y, the roots of g(z), the_y; can be ordered such that 


Mie 


x; 


<8nr", 7 =1,2,...,2. (7.3) 


This theorem gives assurance that, ifa reasonable number ofsignificant 
digits are used for the coefficients, one can expect also some meaningful- 
ness 1n the results. Obviously, the theorem does not take into account 
any errors introduced by the method of solution, which may still affect 
the results considerably. 

After these introductory remarks, some numerical methods for each of 
the three stated problems are discussed in the following sections. 


7.2 Determination of a Single Root Whose Approximate 
Situation Is Known 


It has already been noted that for numerical purposes iterative 
methods are in general the only effective way of finding roots of non- 
linearequations. It willbe convenient here to reformulate the problem 
in the following form. 

Find solutions of 


zZ = g(z) where g(z) = z — A(z) - f(z). (7.4) 


h(z) is supposed to be an analytic function different from zero in the 
neighborhood of the considered root x. 

For this problem, an iterative method of solution can easily be de- 
fined, starting from an initial guess z, and finding new approximations 
z, for the root by the formula 


iC an es oe (7.5) 


It remains to be shown that this process actually converges toa solution 
of the problem stated at the beginning. It is easy to see that, if the 
sequence of the z, converges, it must converge to a root of f(z), since 
both the z, and g(z,_,;) have the same limit. 

In cases where g(z) is an analytic function (e.g., a polynomial) in a 
neighborhood N(x) of the considered root x, the convergence of the given 


Google 


258 SURVEY OF NUMERICAL ANALYSIS 


iteration can be easily shown if |g'(z)| < 1 for all z € N(x). For then 


g(z) —x =(z —x)e'(x) + la(z — x)2e"(x) Hoe + 7 Cae eee 
(7.6) 

where €€ M(x) if ze N(x) or, if g’(x) = g(x) = ee = gl Mx) = 0, 
] are 

e(2) — x =— (2 — 4)"g'"(8). Ga 


n! 


Therefore |g(z) — x| <A |z — x|" for asuitably chosen constant k and z 
in A(x). From this it follows that |z, — x| = |g(z)) — x| < &l2z, — 41’, 


ee — x| ae ee <9 a sda = (AUN 26 —_ x| es kia) (7.8) 


For n # 1, one sees immediately that the process converges if 
KUM-D 20 x} << d; (7.9) 


this means that it converges always if the initial guess is sufficiently close 
to the desired root. 
Forn = 1, 

l2m — x1 <A" [zy — a1; (7.10) 


therefore, the method converges only if & <1, which 1s the case 
if—as has been assumed—|g’(z)| < 1. 

From this consideration it follows also that, for a given initial guess 2», 
the iteration converges the better the more derivatives of g(z) are zero. 

Definition: The iteration z,,, = g(z,) is of order m in a neighbor- 
hood ofa root x ifg’(x) = g"(x) = +++ = g!"-Y(x) = Oand g'" (x) ~ 0. 

Many of the well-known classical methods are special cases of these 
iteration methods. A few of them, which have proved their value in 
numerical work, are discussed here. 


(a) Rule of False Position 


One of the simplest and still quite useful methods for determining real 
roots of a nonlinear function is the rule of false position. The method 
uscs, gcomctrically speaking, the chord between two points [z,,f(z,)] and 
(2, (2) ] in order to find a better approximation z,., to the root x of f(z). 
If zis kept fixed during the whole iteration, this corresponds to choosing 


se) = "Fy fl 


The derivative of g(z) at the root. x 1s 


eg OPORTO Ge aT 
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where é is a point in the interval between Z and x. This expression is in 
general not zero, but for Z sufficiently close to x it will be smaller than 
one in absolute value; that is, the method is convergent. The rule of 
false position is therefore a first-order iteration procedure. 

Instead of keeping Z fixed, one can move it during the computation— 
for example, by using always the latest two points given by the iteration. 
Ostrowski has shown that the speed of convergence is then somewhat 
better than in the considered case. 


(b) Newton’s Method 


Newton’s method works for real and complex roots. In the case of 
real roots, it corresponds, geometrically speaking, to using the tangent to 
the curve f(z) in the last-found point [z,, f(z,)] to find a new approxima- 
tion z,,,- Therefore, 


_,_ £2 
g(z) = Fila) (7.13) 


In order to determine the convergence of this method, the first two 
derivatives at the root x are determined: g(x) = 0, 


ip 

2" (x) rae (7.14) 

The second derivative is therefore in general different from zero, so 

that the Newton method is always convergent (if the first guess is 

sufficiently close to a root) and is a second-order iteration method. 

However, the choice of the initial approximation is very important, since 

by a bad first guess the method may diverge [17]. It has also to be 

noted that for polynomials with real coefficients, a first approximation 

on the real axis can lead only to real roots, since the iteration formula 
gives only real values. 


(c) Laguerre’s Method 


The Laguerre method is an iterative mcthod for determining the 
zeros of polynomials. Iff(z) = P,(z) is a polynomial of nth degree, 
the Laguerre method is obtained by setting 


| soa Nt ees, 5) 
f(z) + V(n — VLC — DSP (2)? — af (2) f"(2)] 
In its geometrical interpretation, this method amounts to approximating 
the polynomial by parabolas between two zeros. Accordingly, there 
are two values of g(z), depending on which root is to be approximated. 


For practical purposes, that sign is used which makes the denominator 
larger; that is, the solution closest to the first guess is approximated. 


g(z) =z — 
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The first few derivatives at a root x are 


(8) =0, es) =0, ge) = ES ES ts 


so that in general the third derivative is the first different from zero; that 
is, the method 1s a third-order iteration procedure. 

In comparison with the other methods so far discussed, the Laguerre 
method has the advantage that it converges faster and that it works also 
for the complex roots of polynomials with real coefficients, even if one 
starts out with a real guess. (The expression under the root in the 
denominator may become negative.) It has, however, the drawback 
that higher-order derivatives have to be computed. This can be done 
most conveniently by using synthetic division (Horner scheme) ; thatis, if 


fl2) = 3 ay-12 


should be evaluated with its derivatives for z = x, then this can be done 
recursively by computing 


4o.9 = > (7.17; 
4:9 = 4;_1,9% + 4, t= LZ yeeegns (7.18' 
Qo,5 = 2%, 
, : 7.19) 
A,.5 = 4;_1, 5% + 4, 5-45 i= I, » —J; ( 
and then 
F(x) =4n0, S'(*%) =a f"(x) = 2! @n_s.2, 
SOR) HS ayy 5 pee pero (7.20) 


(d) Higher-order Processes by a Combination of Lower-order Processes 


Higher-order processes can also be obtained by combining lower-order 
processes. The formula z,;,, = g(z;) generates a sequence of values z,, 
which, as has been shown, converges if |g'(z)| < 1. Geometrically, this 
sequence corresponds to a sequence of points on the curve defined by 
g(z). Any two successive points (Z,_,,Z;) and (Z,,Z,,,) can be used to 
obtain a new estimate by intersecting the straight line through them 
with the line y = z. In formulas, 

2 
en ep (7.21) 
Zi41 — 22, + 24 


This can be interpreted as a new iterative procedure of the same type as 
before, but with another function, G(z), in the place of g(z): 


_ _zglg(z)) — 8*(z) 09) 
Ol) = 7 b6(z) + lela] en 
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It is easy to show that the iteration 
Ziz1 = G(z,) (7.23) 


has a higher order of convergence than the one with g(z), if g’(z) 4 1. 
In particular, if g(z) is of order 1, then the G(z) is at least of order 2; if 
- g(z) is of order r (r > 1), then G(z) is at Jeast of order 2r — 1. 

Proof. In order to simplify the proof, x is assumed to be zero, which 
corresponds to a simple translation ofthe coordinate system. Therefore, 
if g(z) is a convergent process of order r, 


(2) = e(0)2 H+, (7.24) 


ele(2)] == 20) 5 ear +f’ (7.25) 


For r > 1, the term of lowest degree in the numerator is therefore of 
degree 27 in z, in the denominator of degree 1; that is, G(z) expanded in 
powers of z has no terms in z of degree lower than 2r — 1, from which 
the second part of the assertion follows. 

For r = l, 

zelge(z)] = [g’(O)]}?2? +---, (7.26) 

Biz 18 OMe ae Fe (7.27) 

z — 2g(z) + gle(z)] = 21 — 28"(0) + [a(O)]}3 + -+-- (7.28) 

Therefore, in the denominator the coefficient of z is different from zero 

[because of the assumption g’(0) # 1]. So the expansion of G(z) in 

powers of z has no term lower than the second degree; that 1s, G(z) de- 

fines an iteration of order 2 at least. Since it has only to be assumed that 

g'(0) 4 1, this iterative procedure converges even if the iteration defined 
by g(z) does not converge [when |g’(0)| > 1]. 

Obviously, even higher-order iteration procedures can be defined by 
repeated application of this method. In practical applications, itera- 
tion procedures of lower order are preferred, because of the simplicity of 
their application. The advantage of the more rapid convergence of 
higher-order methods comes fully into play only when the approximation 
is sufficiently close to the solution. Therefore, if the entire amount of 
work is considered for finding one zero, an iteration procedure of low- 
order convergence (order 2 or 3) usually is most advantageous. 


7.3. Determination of Approximate Values for All Roots 


(a) Combined Methods 


The methods already discussed can be easily adapted to the deter- 
mination of approximate values for all roots. It is only necessary to 
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combine them with a procedure which prevents the repeated evaluation 
ofthe same root. In the case of polynomials, such a procedure consists, 
for instance, in the elimination of the roots already obtained by syn- 
thetic division. This amounts to carrying through a complete Horner 
scheme, where the argument z 1s equal to the root determined. The co- 
efficients a (pede 2a agents ewen by (7.20), define then a polv- 


n—-Jj,J 


nomial of (n — 1)st degree, P,,_,(z) => a,_,;.;2’~', which has as its roots 


those roots of the original polv nomial sich have not vet been deter- 
mined (provided that the roots are simple). 

In using synthetic division, care has to be taken to minimize the 
accumulation of round-off errors introduced each time because of the 
limited accuracy of the computed roots. It is advisable to start bv 
computing the roots smallest in absolute value and to proceed in the 
order of their relative magnitude. Starting with the larger roots (in 
absolute value) may introduce such large errors into the coefficients of 
the polynomials derived from the original one by synthetic division that 
it is impossible to obtain the smaller roots with the desired accuracy. 
The use of synthetic division has the advantage that the work necessary 
to obtain a root decreases as more roots are obtained, since the degrees 
of the polynomials decrease. 

Another procedure, which avoids to some extent the mentioned diffi- 
culty of the accOmararon of round-off errors, 1s based on the following 


observation: if P,,( => a,_;2? and x,, X2,...,%, are its roots, then 
also 
ee, = ay(z = x,)(2Z = Xy) 7s (z ay X,). (7.29) 
Ca a 
Therefore P(z Py pane ae 5,(z) (7.30) 
gC a C4 eee C05 2 I = 
PH) ~~ 2a 7 22) 80 


The formulas for the Newton and Laguerre methods can be expressed 
in terms of the quantities 5; = 5,(Z), Sg = 59(z): 


— 


(7.32) 


Newton: e(z) =2z— 


. se ; LS SS SS SSS Ses 7.33 
Laguerre g(z) ssn) (7.33) 


Therefore, for these methods the roots which already have been com- 
puted can be climinated by subtracting the appropriate expressions 


Google 


NUMERICAL METHODS FOR FINDING SOLUTIONS OF NONLINEAR EQUATIONS 263 


from s,(z) and s,(z), that is, by substituting in the formulas S,(z) and 
5,(Z), where 


S,(2) = (2) — 3 —— (7.34) 
$,(2) = (2) —¥ —— (7.35) 


1. te = 2° 


(j being the number of roots computed already.) 

This procedure gives in many cases more accurate results than syn- 
thetic division. However, it requires appreciably more work for finding 
all the roots, since one works at all times with the original polynomial. 


(b) Graeffe Method 


Where accuracy is not so important, the so-called Graeffe method 
provides a fast way of computing all the roots of a polynomial. This 
iterative procedure finds approximate values for all the roots at the same 
time, whereas the methods discussed so far determine only one root at a 
time. It is based on the following relation between the roots x; (« = 1, 
2,...,) and the coefficients a, (« = 0, 1, 2,..., 2) of a polynomial of 
nth degree, 


nu n 
Pe) = 2 On? = Ay | (z — x;) 
j= i= 
ay 2 n 
—-=-yr _ XX; — = (—1) "x xQ° 6° x 
a 2, ‘ ao PD i ao Vn iha " 
ew) 


(7.36) 


So, if the x, are all real and well separated, that is, if [x,| > 
[x,.| > +--+ > |x,| then the following relations hold approximately: 


fy fees Hee i Se. 


Since this situation does not exist for all polynomials, a method is 
needed which allows new polynomials to be derived whose roots are in a 
simple relation to the ones of the original polynomial and are well sepa- 
rated in order to take advantage of thisobservation. The Graeffe method 
allows such polynomials to be found. It makes use of the fact that, if 


P.(z) = }a,_,z' has the roots x; (¢ = 1,2,..., 2), then 
j=0 


P(2) = (—1)" Ya,-(—2) (7.38) 
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has the roots —x, (¢ = 1, 2,...,,”). Therefore, the polynomial of nth 
degree in z? obtained by multiplying P,(z) and P¥(z) together, 


1Pa(2t) = Pa(2)PR(2) =a TT (2 + 2)(2—2) = Yate (7.39) 


has as its roots x,?2 (: = 1, 2,..., 2). 
The coefficients ,a, of the new polynomial are computed from the a, of 
the old polynomial by the following relations: 


2, = (Wat +23 (Mayas 


For practical reasons, the factor (—1)/ is usually dropped; that is, one 
computes the polynomial with roots —<x,? instead: 


j 
14; — a;* + 2 > (—1)'a,,,@;_1 J =— 0, I, 2, eeey nN. (7.40) 
i=1 


All these coefficients are positive for real roots. By repeated applica- 
tion of this squaring method, polynomials ee (z) = > 14,32) Can be 


obtained, the roots of which are equal to —x,?,, and the eovthcients are 
determined recursively by 


j 
e2y = p-10,? + 2 2(—1)' e194 54 e194 5-2 (7.41) 


If all roots are real and simple, then for a sufficiently large k the roots 
of ,P,,(z) will be well separated, so that the approximate relations given 
before can be used to dae the roots: 


ka; 

log x, = = OF Sigg Z, PS 2 eagn: (7.42) 
This situation 1s realized if the sum of the cross products in the above 
formula for the coefficients is small with respect to ,_,4,? (jy = 1,2,...,2). 


So the ratios 


J 
D (HV) hse 285-1 
Or ee a eo OP (7.43) 
xo ; 
indicate when the iterative procedure of computing polynomials has 
produced a polynomial with well-separated roots. 

The roots are determined only up to the sign by the given logarithmic 
relation. The sign has to be determined by back substitution or other 
methods. In the case of complex or multiple roots, at least some of the 
successively computed coefficients fail to exhibit the described behavior. 
The Graeffe method can be modified so as to take care also of these cases. 
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(c) Lehmer Method 


Instead of discussing these modifications in detail, consideration is 
given here to a method, due to Brodetsky, Smeal, and Lehmer, which 
allows the roots to be computed directly also in these special cases. This 
method amounts to performing two Graeffe methods, one for the poly- 
nomial P,(z), the other for P,(z + A), where & is an infinitesimal shift 
in the origin. The two processes can be combined at a considerable 
saving in labor and will produce the roots (real or complex) with proper 
sign. 

The formulas for the Lehmer method can be easily derived from those 
for the Graeffe method, if one notes that P,(z + A) is again a polynomial 
in z, the coefficients of which depend on A: 


Plz +h) = Say s(h)2! = Py( zs) (7.44) 


P,(z,h) has x; + has roots. Therefore, if the previous root-squaring 
procedure is carried out, coefficients ,a;(A) are obtained, which are 
functions of A. Assuming & to be very small and developing all co- 
efficients up to the first power in A, one obtains from (7.41) 


j 
44 ;(h) = wa; the ,b; = 4-14)? + Ps a) k—-14 540 K-14 5-1 
j 
a ah 2 C1) arse 185-0 (7.45) 
=—j 


where ,6; = 5 [.4;(A) ],-9 can be computed recursively by the formula: 


j 
10; = 2 > (1)! panne e-195-05 i= Lee ehh = le Ze oode as 
7 (7.46) 
It follows immediately that 
9 = 0 for all k, (7.47) 


SINCE ,@y(k) = 4d). For k = 0, the coefficients ,b,; are determined from 
the relation 


dP_,(z,h) 
P,,(z,h) pEaA P(2) le t— be 
dP,,(z,h) d d 
d n b] == h oe eons P 
an dh ale dh Pi(z os ) ee dz 2) 
therefore : 
popes (seg lag. of Sy eg: (7.48) 


The ,a,; are computed by formula (7.41). With the aid of the ,4,, all 
the roots can be directly computed. The following cases have to be 
distinguished : 
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1. All the roots are real and simple. ‘Then, for sufficiently large {, 
the relations (7.37) hold: 


(I 
(x, | h) 4 = 1a ) d aa l, Z, b) n, 
42; (f) 
a UP a er ee ee, 
K4j_1 Ko j-1 
comparing terms, one obtains 
l 
=F : ee ee ae yn 
i Os-1 


w4; -4 5-1 


From this expression one sees that, for numerical purposes, it is more 
convenient to define the coefficients by the following set of formulas: 


i] 
es = p10 ;? + 221-1) k—-1% 54:1 k-19 5-03 (7.49) 
j 
iD ; ae Cal: 4-19 540 e-195- 05 J=1,2,...,2, (7.50) 
ey 
429 = 0 
and pa a5, ISON SD wig ts 
0); = (n ae te I)a;_,, j= Li 2p cang hs 
l 
x, =— 5 be 12,6 cet. 7.91 
9; Oi ( | 


Ko, KG 
2. The first root has multiplicity m; the others are real and simple. 
Then, for sufficiently large k, the relations (7.36) give 


x, (h) yer te (x ae hm ciel F(A) 
1 >= -_— 5 


KQg Kg 


(x, oe A)?" (Xu a 


m(x, + A) = 


hy" = r4 m1 (A) 
— bd 
K%Q 
so that only in ,a,,,..., ,a, will the cross products in (7.49) vanish as 
described in the case of a simple root. 
Applying the same methods as before, one obtains 


| 1 aes oa ID 


1 ,) m+1 
1 = PU ) 


Om sd _ Om 

c8 CSC kK2m+1  k2m 
Therefore, the formulas have not changed except for the omission of the 
expressions for Xg, 2. 6) Xiq. 
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3. The first two roots are conjugate complex; the others are real and 
simple; that 1s, 
x) — pe'®, Xo — pe-'®, 


Then, for sufficiently large k, one obtains, if the terms of zero order in h 
are compared, 
Ei = +2p* cos 2*p + x37 +++ + x,2 = 2p* cos 2p 


x40 


if 2*» ~ multiple of 5 (7.53) 


kOe = pert, (7.54) 
x29 

Therefore, the conjugate complex roots cause an oscillation in ,4,, 
which will no longer stay necessarily positive. The other coefficients ,a,; 
behave as usual. The comparison of the terms of order A gives 
us == Qkt152"*'—1 cos g, (7.55) 

K2q 

From (7.54), p can be determined uniquely, since it is positive, and 
from (7.55), one can find g, the two possible values of which give x, and 
Xp. 

Other cases can be similarly discussed. All have in common that, as 
soon as all the roots are no longer real and simple, then in some of the 
coefficients ,a, the cross products are not becoming negligible in com- 
parison to the other term for increasing k. Further, it can be shown that 
the ,a,; and ,4, give sufficient information so that the roots can be com- 
puted without any back substitution into the original polynomial. 

When writing a program for an electronic computer, one has to decide 
how many of these cases one wants to incorporate, the alternative being 
that ,a,, ,5; are printed out if only part of them exhibit the desired 
behavior for large k and that the further processing is then done manu- 
ally. This decision depends somewhat onindividual taste. In general, 
experience shows that it 1s not worthwhile to build a lot of sophistication 
into a code, because this is usually done at a considerable expense in 
time. 

From the practical point of view, it has also to be mentioned that the 
Graeffe methods are not self-correcting iteration procedures, so that the 
round-off errors accumulate. In some cases it may therefore be neces- 
sary to check the accuracy of the roots obtained by a substitution into 
the original polynomial. From the derivation of the method, it is ob- 
vious that the roots with largest modulus are obtained with the best 
accuracy. 
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7.4 Determination of the Number of Roots in a Given 
Region of the Complex Plane 


First the special case is considered where the number of roots in an 
interval of the real axis of a polynomial with real coefficients has to 
be determined. 


(a) The Region Is an Interval on the Real Axts 


An economic method to solve this problem makes use of the so-called 
Sturmian sequences and thus avoids the explicit computation of 
the roots. Since the algorithm to compute the particular Sturmian 
sequence for this case can also be used to determine multiple roots 
and thus to reduce any problem of finding roots of a polynomial to find- 
ing simple roots, it is briefly discussed here. 

First it has to be shown how Sturmian sequences can be used for 
solving the stated problem. For this one has to recall the following 
definition. 

Definition. A Sturmian sequence is a sequence of functions 
Sin(Z)sfn—-1(Z)s © + + »fo(2Z) which satisfy on a given interval [a,4] of the real 
axis the following conditions: 

1. f,(z) = continuous function (2: = n,n — 1,..., 0). 

2. Sign fo(z) = constant fora <z< 6b. 

3. If f,(z) = 0, fi4i(z) and f,_,(z) # 0 fora < z <b and all z. 

4. If f,(z) = 0, sign f,,(z) = —sign/f;_,(z) (@ =n —1,..., 1). 

5. If z = xis a root of f,(z), then for A sufficiently small, 

f(x — A) filx +h) _, 
Sn—1(% — A) Sn-(x + A) 

From these properties, the following theorem can be deduced. 

Theorem 7.2. The number m of roots of the function f,(z) on the interval 
[a,b] is equal to the difference between the number of changes of sign in the sequences 
Sn(@)s fn—1(4)s «+» »fol@) and f(b), fr—1(8)s «+ + » fo(4). 

This statement can be verified by following the number of changes in 
sign in the sequence as zincreases from ato 5. This number can change 
only if one function or several functions go through zero with increasing 
z. Because of the properties of the Sturmian functions, this number 
actually changes only if f,(z) goes through a zero. This can be easily 
shown, since if f,(Z) = 0 (i 4 n, 0), the following situations are possible 
according to properties 1, 3, and 4 (A small): 


sign = —], sign 


= Sin Si SFist or Sin ti Sin 
z—h + + - - + + 
Zz + 0 — = 0 + 
Z+h ; = = = zs 7 
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In both cases, the number of changes in sign in the sequence remains the 
same. According to property 2, f,(z) cannot cause any such change. 
From property 5, however, it follows that this number changes at each 
root of £,(z), which proves the correctness of the given theorem. 

Therefore, the number of roots of f,,(z) on [a,5] can be obtained simply 
by evaluating a Sturmian sequence /,(Z), fa_1(Z),---5 f(z) for z =a 
and z = b. 

It remains to give a method for constructing such a sequence. Ifone 
sets f,_,(z) = (d/dz) f,(z), then property 5 is satisfied. Using f,(z) and 
fa-\(Z), One can generate the rest of the functions by the euclidean 
algorithm: 


F(Z) = fa—1(Z) Bn—1(Z) = J24(2); where the g;(Z) 


are linear functions (7.56) 
of z, 


Fi(Z) = fi-1(2) 8s-1(2) — fr-a(2) t=nn—1,...,2. 


Properties 2, 3, and 4 are easily verified for this sequence, and property 1 
follows from the continuity of f,(z) and f,_,(z) iff,(z) is a polynomial. 


If f(z) = P,(z) = > a,_,2', the functions in the sequence are poly- 
j=0 


nomials of decreasing degree: 
F(z) = 296,442! (7.57) 
j = 


The coefficients are recursively determined by the relations 


j) =0,1,...,2—-1, 


Gis = bay j42 + Cg 544 — Bian, ster . | 
ri = n = b>} n reas 2: se e e 9 L; 


(7.58) 
] 
with 6; = — ’ €¢, = — (4j414 — 5,4;,), Qi i+, = 9, 
a;.0 a;.0 
and a, , = 4,, Qny.5 = (n —J)ay, J =0,1,2,...,20. 


If P,.(z) has multiple roots, then this algorithm does not produce a 
complete Sturmian sequence, since some f,(z) becomes zero, where the 1 
depends on the multiplicity of the roots [_f,,,(z) being the greatest com- 
mon divisor of P,(z) and P,(z)]. So this case has to be treated sepa- 
rately. 

Polynomials with Multiple Roots. Ifa polynomial has multiple 
roots, the problem of finding its roots can be reduced to finding the 
roots of some lower-degree polynomials which have only roots with 
multiplicity one by applying the euclidean algorithm. 
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Since the greatest common divisor of P,(z) and P/(z) contains the 
roots which have multiplicity p in P,(z) with multiplicity p — 1, a re- 
peated application of the algorithm described before will generate a 
sequence of polynomials which will contain the multiple roots with de- 
creasing multiplicity. ‘These polynomials will be designated here by 
F,(z), Fa(z),..»,,(z), where 


F,(z) = greatest common divisor of P,(z), P,(z), 
F(z) = greatest common divisor of F(z), F;(2), 


F(z) = greatest common divisor of F,_,(z), Fy_1(z)- 


Here & is the largest multiplicity occurring among the roots of P,.:) 
and F(z) contains those roots of P,(z) which have multiplicity 4, k — |, 
o»Jtly=1,2,...,4 —1). [They haveinF,(z) the multiplicities 

k —j,...,1.] Therefore, F,(z) is a constant, and F,_,(z) is a polv- 
nomial which has as simple roots those roots of P,(z) which have mult- 
plicity k. 

Py. 5(Z 

Bs = tale 


is a polynomial which has as simple roots those roots of P,(z) which have 
multiplicity k — 1. 
In general, if F(z) = P,(z), then, for; =k — 2,...,1, 


F_1(Z)F3.(2) ne ae 
F 2(z) = q;(Z) (7.59) 
is a polynomial which has as simple roots those roots of P,,(z) which have 
multiplicity 7. The computation of the F(z) is defined by formula 
(7.58), completed by the additional rule that, if a; ; = 0 for all 7 = 0, 
],...,12, these values have to be replaced by 


a5 = (+1 -J)ayi1. 55 J =O, 1,...5% 


The computation of the g,(z) can be arranged conveniently in the 
following way. 

The quotients F’,_,(z)/F,(z) can be directly computed from the 4,, ¢, 
of formula (7.58), since, by back substitution in the relations (7.56), it 
can be readily shown that they are sums of products of g,(z) = 6;z — ¢,- 
Here consideration is restricted to finding Fy(z)/F,(z), since’ the other 
computations follow the same pattern. 

Assume that the euclidean algorithm has given /,(z) as the greatest 
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common divisor of P,(z) =/,(z) and f,_1(z) = P,(z). Then, from 
(7.56), one obtains 


SFn(Z) = 8n-1(2) [Bn-2(2)fn—2(Z) — fa—a(Z)] — fn—2(Z) 
= fa(Z)Sn—2(2Z) — PilZ)fn-(2Z) = Pa(Z)[8n-a(2Z) fn—a(Z) —Sn-a(2)] 


— fi(Z)fnr—s(Z) 
= p3(Z)fi-3(Z) — po(2) fn-a(Z) 


= Pn—n-1(Z) fasa(Z) — Pa—n—2(Z) Si (Z) 
a Pn—x(Z) fe (Z); 


where 9;41(2) = £:(Z)8n—s-1(Z) — Pi-alZ)3— Pa(Z) = 8n-1(Z), Po = 1, 


i= es 


9S, eee Rh — ° 


If the coefficients of p;(z) are denoted in the following way, 


pilz) = TB, i=1,2,...,2—k, (7.60) 
j=0 


then they can be computed recursively at the same time as the a, , of 
formula (7.58), using the relations 


Bis = Cn—-e-1Bi-a.s + On-i-yBi-1,3-1 — Bi-2.49 
ee ees Pe oe eS ee (7.61) 
and §,;,=0 # forj >iorj <0. 


We then have p,_;(Z) = Fo(z)/F,(z). 

The other quotients are obtained analogously. Once these quotients 
are known, the coefficients of the ¢;(z) can be computed recursively, 
starting with the coefficient of the highest power from the relations 

F,_1(2) - F(z) 


F(z) F’541(2) 1it2) 

This method can be used in general to pretreat polynomials which are 
suspected of possessing multiple roots, in order to obtain polynomials 
which have only simple roots. Thus difficulties can be avoided when 
using any of the previously given methods for finding the roots. At the 
same time, one has also all the information available for determining the 
number of roots on a given interval of the real axis. 


(b) General Case 


The general case where one has to determine the number of roots 
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of f(z) in a certain region of the complex plane is solved by using the 
well-known relation that, if f(z) has no poles in the considered region 


with boundary B, 


Qni Je f(z) 


where N is the number of roots (counted according to their multiplicity’ 
in the region. 

Numerically, this relation can be evaluated in two different ways. 
The first is to use numericalintegration. Ifthe boundary Bisa polygon, 
the integrals over each side can be evaluated by using a suitable numer- 
ical integration formula (e.g., Gauss integration formulas). All other 
cases can be solved by approximating the boundary by a polygon. In 
selecting the integration method, any poles of the integrand arising from 
the existence of simple roots of f(z) on the boundary have to be taken 
into account; otherwise the accuracy of the numerical integration rules 
may not be satisfactory (see, e.g., [1] or [3]). If such roots exist, then 
the real and imaginary parts of f(z), written as functions of the pa- 
rameter ¢ used for defining the boundary, must have common factors 
which can be found by the euclidean algorithm. 

Another method is particularly useful if the number of roots in a half 
plane have to be computed. It is based on the following observations. 
If one chooses for B a large circle of radius 7 around the origin, then, with 
Z = 76", 


on 
Le | 
e 
a) 

In 


nn n | 

f(z) = 2 4n—32? = 2 On per ay er, (7.63: 
I= j= 
J 2) aon ene. 


and so (7.62) gives N =n. Thus all the roots of f(z) have to lie within 
the circle of radius r. One can easily see that, for 
r a DAs where a, =A ,e' (p20 ld wtagy- 708) 


Oj=l 


this is certainly true, since ifr > I, 
n n ; 
Ay > DA;|z'-{ or — Agr® > 3 A; |Z", 
j=l pat 


so that | f(z)| > 0 for |z|} >r > 1. 

Therefore, to determine the roots in a half plane bounded by a straight 
line s, itis sufficient to determine the roots in a region bounded by this 
line and a sufficiently large half circle, using the expression (7.62). 
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Setting f(z) = Re'¥, one can replace the integral in (7.62) by the ex- 
pression 


! 
N=s bey, (7.65) 


where B is the image of B defined by f(z). From (7.63), one concludes 
that the value of the integral over the image of the half circle is 7mm. The 
contribution from the straight line is equal to 27 times the number of 
turns of the image of the straight line around the origin. The number 
of turns can be obtained by counting the number of zeros of the real or 
imaginary part of f(z), as z varies on the given straight line. Ifa zero 
occurs for increasing y, it has to be counted positively; in the other case, 
negatively. The number of zeros can be determined by forming a 
Sturmian sequence either with f,(z) = Re[_f(z)] = Rcos y and f,_,(z) 
= —Im[/f(z)] = —A& sin y or with f(z) = Im[ f(z)] = & sin y and 
JTn-1(Z) = Re[f(z)] (depending on which of the two is the higher-degree 
polynomial). One can verify immediately that, if one considers the 
signs of the functions in these sequences, the sequences lose a change in 
sign if f,,(z) goes through zero for increasing y and gain a change in sign 
if it goes through zero for decreasing y. Therefore, the difference be- 
tween the number N;, of changes in sign between the successive functions 
at the initial point of the path of integration and the number JN, of 
changes in sign at the other end of the path of integration is equal to 
twice the number of turns. So 
n N,—N, 
sat cane 
These considerations can easily be extended to the case where the 
number of roots have to be determined in a region bounded by straight 
lines. The described method can also be used for determining approxi- 
mate values of roots by combining it with a method of subdividing into 
smaller and smaller parts the region in which the number of roots have 
to be computed. 
Solution of Systems of Nonlinear Equations. In this case, the 
problem is to find solutions for a set of m nonlinear equations in m in- 
dependent variables: 


JA Co tasecn) = 0; e—2 Pe Pere (7.67) 


where the f; are analytic functions of the €, in the neighborhood of zeros. 
For convenience, the {, will be assumed to be real-valued. 

The notation used previously can be taken over, if one understands 
now by f(z) a vector with m components f,, depending on the vector 
z= (¢,,..., ¢,,) nonlinearly. ‘Two groups of methods of solution are 
considered here: functional iterations and minimizing methods. 


(7.66) 
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Functional Iterations. Asin condition 5, one introduces, under analo- 


gous assumptions, 
z= g(z) =z — h(z)f(2); (7.68) 


which is now a relation between vectors and which splits up into m 
equations. The convergence of the iteration 


4 8(2,-1), 1 2 D5 ace (7.69) 


has to be examined for this new situation. The proofofthe convergence 
is analogous to the one for Just one nonlinear equation. It is convenient 
here to introduce as an assumption the Lipschitz condition 


max |g(z’) — g(z”)| <A max |[z’ — 2’ with k <1 (7.70) 


for z’, Zz” in a certain neighborhood N(x) of a solution of (7.68). By 
“‘max,”’ the largest component of the vectors in absolute value is to be 
understood. Then, if z, is the first vector approximating the solution x 
and if z, is the next approximation defined by (7.69), one has, assuming 
Z) in N(x), 
max |Z, — x| = max |g(z,) — x| < & max |z) — «| 

and max |z; — x| = k? max |Z) — x], ye eee eee 
Hence, the z; converge with increasing 7 to x under the stated assump- 
tions, which are only sufficient conditions to ascertain convergence. 

An example of such an iteration method is the Newton method gen- 


eralized for this case. The formulas are easily derived if f(x) = 0 is 
developed into Taylor series around the approximation z = (¢),..-5 On) 


Etre & eeceerrrar es, de) : 
0 =flz) + $y — b) ge) +0 


If the inverse of the Jacobian J(z) = [0f,(z)/0¢,] exists, then x = 
z —J(z) f(z). So, in this case 


g(z) =z —JU(2)f(2), (7.71) 


and the iteration is 
Zig SZ Sd 2) 72), ie — a ee rec Pee (7.72) 


It may be rather cumbersome to evaluate the inverse of the Jacobian 
J—(z,;) foreach i. It has therefore been suggested that J—1(z) be kept 
fixed. Naturally, this will decrease the speed of convergence, but it can 
be shown that, if the first guess is sufficiently close to the solution, such a 
procedure still converges for a large class of functions f,(z). In actual 
numerical work, considerable care in preserving enough significant digits 
when evaluating the right-hand side of (7.72) may be required, since, 
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particularly if the number of equations is large, the computation of f(z,) 
and J-}(z,) can become rather difficult, owing to losses in significant 
digits. 

Other methods, analogous to those given for the single equation, can 
be devised. 

Minimizing Methods. In these methods, the problems (7.67) are re- 
placed by the problem of finding the minimum of one function, F(z), 
which is so defined that for the solutions of (7.67) it attains a minimum. 
Such a function may be defined, for instance, as 


F(2) = ¥ Hlahle) (7.73) 
or F(z) => Lf(z)1- (7.74) 


In numerical work, the form (7.73) is frequently preferable, since 
(7.74) leads often to a very narrow minimum, so that it may be very 
difficult to find an initial guess which—when applied with some iterative 
method—will lead toa solution. Therefore, only the case (7.73) is con- 
sidered here. 

Numerical methods for finding the minimum of F(z) can be easily 
devised by using the geometrical picture. In the neighborhood of a 
solution x, F(z) represents, according to the assumptions, a concave sur- 
face. Ifone starts, therefore, from an initial guess z, sufficiently close to 
x and proceeds in the proper direction d, along any straight line, except 
the one which is tangent to/(z) = constant, one can always get to a point 
z, which is closer to x: 

Z1 = Z% + ady. (7.75) 


The closest point to x in the particular direction d, can be found by 
determining the minimum of F(z, + ad,) as a function of a; that is, @ is 
given by the equation 
dF (Z) + ad,) 
da 
which is, in general, nonlinear in a. 
This process may be repeated with another direction d,, and so on. 
For the directions, the following two choices are most frequently used: 
1. The directions d,; are the gradients of F: 
d,; = grad F(z,) (7.77) 


[where grad F(z,;) means grad F(z) for z = z,]. This is combined with 
(7.76), 


= 0, (7.76) 


= 78 
7a 0, (7.78) 
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leading to a,, so that 
Zi41 = 2, + 4, grad F(z,), ieee 0 a) (i aera (7.79. 


This method is called the method of steepest descent. It can be shown 
that the z, converge to x for an appropriately chosen Zp, if F(z) is analytic 
in a neighborhood of x. 

In some cases, it may be rather difficult to evaluate grad F(z,). In 
these situations, it may be more convenient to use directions d, parallel 
to the coordinate axis. 

2. d,is aset of unit vectors in the direction of a rectangular coordinate 
system. With this choice, the method used amounts to solving the set of 
generally nonlinear equations in a: 


dF (z + ae;) 
dl; 


—where ¢, is a unit vector in the direction of the coordinate ¢ ;—cyclically 
starting with an initial guess Z = Zp». In each step, one of the equations 


(7.80) is solved, giving an a, and z,,, (t = 0,1, 2,...): 


= 0, J =1,2,3,...,m [2 (Gees Ga) | (7.80, 


Zu. = 2, + ae,, with 1,, =i — B m, (7.81) 
where [1/m] is the largest integer smaller or equal to the quotient :/m. 
One should perhaps note that the nonlinear equations (7.80) are not 
identical with the original equations (7.67). 

This method is analogous to the Gauss-Seidel method for linear equa- 
tions. The relative simplicity of computations for each of its iterations 
has to be paid for, in general, by a loss in the speed of convergence. 

Some other methods, like the conjugate-gradient method, have been 
devised for finding the minimum of (7.73) (see [2]). 

It should be pointed out here that in the case of systems of nonlinear 
equations, the choice of the initial guess z, is frequently a difficult prob- 
lem, since all the methods described converge only if the initial guess =, 
is sufficiently close to the solution. For this reason, it will frequently be 
necessary to make a comprehensive tabulation of F(z), which will indi- 
cate the behavior of this function. 

Finally, one has to admit that the known numerical methods for the 
solution of large systems of nonlinear equations are sometimes far from 
satisfactory. If they converge at all to some values in a reasonable 
amount of time, there remains still the difficult question of the accuracy 
of the answer thus obtained. In most cases, there do not exist any 
practical estimates for the error, so that the only means of checking the 
solutions, at least to a certain extent, is the substitution of the computed 
values into the original equations. 
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Therefore, there remain still a number of problems in this field, which 
has become more and more important in the past few years, owing to the 
tremendous advances in science and technology. 
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8.1 Introduction 


Most methods for finding the eigenvalues of a finite matrix lead to 
difficulties when applied to an arbitrary matrix. It is therefore of 
importance to obtain at least estimates for the eigenvalues. Some 
methods for finding the eigenvalues actually depend on the knowledge 
of such estimates. Furthermore, for many practical problems the 
exact eigenvalues are not even required, and bounds will quite often 
suffice. 

Here only three types of bounds are discussed. Several others have 
been developed, some in quite recent years. The bounds to be dis- 
cussed arise from (1) the field of values, (2) the Gersgorin circles, and 
(3) the majorization by nonnegative matrices. No completeness in 
material or bibliography is aimed at. 

We note that the material presented here is becoming of increasing 
interest for numerical analysts. For instance, the theory of nonnegative 
matrices is being applied intensively (see Varga [56]) in the study of 
iterative solutions of the difference equations approximating partial 
differential equations arising from important technological problems. 
Also, the theory of stable matrices, that is, matrices whose eigenvalues 
are all in the left half plane, continues to develop (see Bellman [55], 
Gantmakher [9], Taussky [51, 52], Ostrowski and Schneider [53]). 
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THE FIELD OF VALUES 
8.2 Definition and Basic Properties 


Let A = (a,,) be ann x n matrix whose elements are complex num- 
bers. The field of values F(A) of such a matrix is the set of all numbers 


A yX,X, = (Axx) = &’Ax 


iMa 


where x is a vector (x,,...,%,) with pe *, = 1. This concept was 


introduced by Hausdorff ane and Toeplitz [37], who proved that F(A) 
is a bounded, closed, and convex set. ‘That F(A) is bounded and closed 
follows, of course, immediately from the fact that it is a continuous 


function of the points of the unit sphere > x,*; = 1. Itcan further be 


=1 
seen at once that F(A) contains the eigenvalues A, of A. For Ax = 1x 
with x’x = 1 implies 

X AX SARK =A: 


8.3 The Field of Values of Hermitian Matrices 


Let A be a real symmetric matrix A = A’ oracomplex, but hermitian, 
matrix; that is, A = A’ = A*. In this case we have 


K'Ax = x'Ak = &'A*x = x'Ax. 


(The second equality comes from the fact that we can transpose a scalar.) 
Hence F(A) is real for an hermitian A. In particular, the eigenvalues 
are real. From the fact that F(A) is closed, bounded, and convex, it 
follows then that for an hermitian A it is a closed interval on the real 
line. The end points are known to be the largest and smallest eigen- 
values, Ajax aNd Anin, Of A. Since the diagonal elements of A belong to 
F(A), we have, in particular, 


Aas = max Giay Amin < min A, je 


8.4 The Convex Hull of the Eigenvalues 


The fact that for an hermitian matrix the end points of F(A) are A,,. ax 
and Ajjin 1S a special case of a much more general fact: since F(A) is 
convex and contains the eigenvalues /,,..., 4,, 1t contains the convex 
closure C(A) of the A,, that is, the smallest convex and closed set which 
includes them. The question arises; IsC(A) = F(A)? Ifitis, then we 
know that the vertices of C(A) will be eigenvalues. 

However, in general, F(A) #4 C(A). If, on the other hand, 4 is 
normal, then equality occurs, as is shown below. 
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8.5 Normal Matrices 


A matrix A is called normal when 
AA* = A*A 


where A* = A’; for example, hermitian matrices are normal since they 
satisfy A* = A. Also, unitary matrices, that is, matrices with A-! = A*, 
are normal; in the case that A is real, they are called orthogonal. 

One of the most important applications of unitary matrices is pro- 
vided by the fact that every matrix A can be transformed to upper tri- 
angular form by a unitary similarity; that is, a unitary matrix U exists 
such that 

UTtAU = (b;x), 


with 5, = 0 when: > k. (For a proof of this result see Schwerdtfeger 
[54, p. 203]; note that the matrix B is not unique.) Since U is unitary, 
this implies that U-1A*U = (6,,)* = (0,,). 

If, further, A is normal, then U-!AU is normal too, which implies that 


(Dix) (Din) * = (Osx) * (Dix): 
Equating the diagonal elements of the two products above, we obtain 
b,. = 9, Lok. 


‘This implies that every normal matrix can be transformed by a unitary 
similarity into a diagonal matrix. Since similar matrices have the same 
eigenvalues, this diagonal matrix consists of the eigenvalues of the 
normal matrix. 


8.6 Invariance of the Field of Values 


Another property of F(A) which is used to prove that F(A) coincides 
with C(A) for a normal A is the fact that, for a unitary A, 


F(A) = F(U-1AU). 


This follows immediately from the definition of F(A) as the set of num- 
bers *’Ax for x’x = 1; replacing A by U-1AU, we obtain *’U—AUx, 


which belongs to F(A), since #’U-! = (Ux)'’ and *'U-!Ux = 1 for x'x = 
1. On the other hand, every number <’Ax can be written in the form 


(#’U)(U-1AU)(U-1x) = (O-*x)'(U-1AU) (U-)x), and again %’UU-1x = 
1 for #’x = 1. 


8.7 The Field of Values for Normal Matrices 


Hence we need to prove the property F(A) = C(A) for A normal only 
for the diagonal matrix diag (A,,..., 4,) formed by the eigenvalues of 
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A. For such a diagonal matrix the field of values is 


n 
ARs 
‘<1 


Since x,¥, >0 and > x,x, = 1, itis clear that F(A) is the convex closure 
of the A,’s. or 

Normal matrices are not the only ones for which F(A) = C/A) unless 
n <4. For all n> 4, there are nonnormal matrices with this prop- 
erty (see Moyls and Marcus [21]). 


8.8 Generalizations of the Field of Values 


The following generalization of the definition of F(A) was suggested 
by Givens [12]: 
F(A) = x*HAx for x*Hx = 1 


where #/ is a positive definite matrix. It can then be shown that C(A) 
is the intersection of all F,,(A) for all possible choices of H. 
Another generalization was suggested by K. Fan (unpublished): let 


; 

F(A) be the set of all numbers > (Ax‘,x’) when the set x‘ varies over all 
i=1 

systems of r orthonormal vectors. 


8.9 The Field of Values of Sums and Products 


It is easy to see that the field of values of A + Bis contained in the set 
of numbers F(A) + F(B) where S, + S, means here the set of all numbers 
o, + o, when o, € S; and o, €S,. Similarly the field of values of AB is 
contained in F(A)F(B) for an analogous definition. 


8.10 Singular Values of Matrices 


If 4, are the eigenvalues of A, then A* has as eigenvalues 4,;. The 
eigenvalues of AA* are in general not 4,4; For normal matrices they 
are A,A; and conversely (see Parker [27], Hoffman and Taussky 
[17]). This can be shown by the methods used in Sec. 8.3. The 
positive square roots of the eigenvalues of AA* are called the singular 
values of A. The following inequalitics hold between the eigenvalues 
and the singular values (see Browne [6], Brauer [4], and Weyl] [38]): 


Ain( 4A *) ss |A,(.4)|? < Awiax( aa”). 


These inequalities matter very much, for it is easier to find bounds for 
the eigenvalues of the hermitian matrix 44* than for the general matrix 
A, The inequalities can be proved easily, again by transforming A to 
triangular form by a unitary similarity. 
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8.11 Application of the Field of Values to Bounds of 
Eigenvalues 


The exact form of the set F(A) is complicated and not of immediate 
computational use (see Murnaghan [22]). 
However, useful bounds can be obtained from F(A) with little trouble: 


[> 4,,%,%,| << ¥ la, for > x,x, = 1; 
hence [Amaxl < > |4l- 
This is a very crude estimate. A better bound, 
lAmaxl < (2 lavel?)*, 
can be obtained by applying the Schwarz inequality as follows: 
D> ase%l? < Dd lal? D ledl? = D laiel? D lel? DY bel? = D lac? 
Using F(A), we also obtain 


> G5 4X ;X 5 


1k=1 


n 


n 
< > lax] |*;*,| << max |a,,| 2 Ix] [Xx 
ik=1 k=l 


n n 
= max [al > ld > 14] < max |a,,| 2. 
j i=1 


The last inequality can be obtained by applying the Schwarz inequality 
(2a,b,)? < La? 2b; to the sets a, = |x,|,6; = 1. This implies > |x,| < 
vn. Hence a 


[Amax! < max |a,,| n. 


Bounds can also be obtained for the real and imaginary parts of the 
eigenvalues. Observe that 


Re ( » ak.) =12 ¥ (ay, + 45) %i%, 
a1 ic 


and apply the above estimates to the matrix }2(A + A*). 
It follows that 
|Re (A;)| < max 48 |dix + ay;| 2. 


By a similar argument 
[Im (A,)| < max % jay, — @;|" 


is obtained. If a,, = a,;, then Im (A,;) = 0 follows, which is a well- 
known fact for hermitian matrices. If a,, = —a,,, then Re (A,) = 0 
follows, which is also well known for skew hermitian matrices. 


Go gle 


284 SURVEY OF NUMERICAL ANALYSIS 


The bounds for |A,| and |Re (A;)| are best possible, as is seen by taking 
as A a matrix all of whose elements are 1. However, for real matrices 
the bound for |Im (A,)| can be improved to 


[Im (A,)| < max % |a,, — a,,| cot am 


2n 
This bound is best possible, as can be seen by taking as A the matrix 
0 11 
—1 01 
—1 —10 


whose characteristic equation is (x + 1)" + (x — 1)" = 0 and whose 
eigenvalues are i cot [(2k — 1)m/2n]. These last bounds have been 
developed by Hirsch [16], Bendixson [2], and Pick [28]. 


8.12 The Norm of a Matrix 


Although F(A) is invariant under unitary similarity transformations, 
these last bounds are not invariant. An invariant bound mentioned 
earlier can also be found by using the following inequality due to Schur 
[32]: 


n n 
Dis Dd laiel?. 
i=] ‘b= 1 
This implies immediately the inequality 
|Amaxl < (> |a;,|)* 


The quantity (= |a,,|2)% is called norm A. To show that it is invariant 
under unitary transformations of A, use the fact that (norm A)? = 
tr (AA*); since further tr [(U-!AU)(U-!AU) *] = tr (U“!-AA*U) and 
the trace of a matrix is invariant under even arbitrary similarity trans- 
formations, the invariance of norm A is established. 

Since norm A is invariant under unitary transformations, the in- 
equality of Schur can be proved by assuming 4A in triangular form. 
This proof further exhibits that equality holds if and only if the tri- 
angular matrix is actually diagonal. This means for the original matrix 
that equality holds in the Schur inequality if and only if A is normal. 


8.13 Row and Column Sums 


It can be shown that 


n 
max > |a;,| and max ) [a;,| 
i k=l 


i= 


are bounds for the absolute values of the Een ee of A. Even these 
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bounds can be replaced by a bound which is, in general, better. This 
is done in Sec. 8.15. 


THE GERSGORIN CIRCLES 


8.14 A Determinant Theorem: Matrices with Dominant 
Diagonal 
The bounds announced will be obtained easily from a simple 
theorem which has turned up again and again in totally different 
branches of mathematics (for a bibliography, see Taussky [34]). 
It deals with so-called ‘‘matrices with dominant diagonal,” that 1s, 
matrices A = (a,,) for which 


n 
Ia, > 2 lainls = I, oe yg M. 
k#i 


Such matrices play a big role in computational problems, since they 
are not too remote from diagonal matrices. Many processes for finding 
the inverse of a matrix work particularly well for such matrices. 

The determinant theorem in question states that a matrix with 
dominant diagonal has an inverse. This theorem can be generalized 
if we generalize the concept of dominant diagonal by including also 
the possibility of equality in the above relations. However, we must 
obviously exclude the case that equality holds for all n equations. 
It has further to be assumed that the matrix is indecomposable*, that 1s, 


that it cannot be brought into the form (6 4 by a simultaneous row 


and column permutation. Here P and R are square matrices, and O 
consists of zeros only. The theorem then finally becomes: 
An indecomposable matrix with a dominant diagonal in the generalized 


sense, for which not all the relations . 


n 
Ia, = 2 Iaszl, = l, 220 A, 
= 


kHi 
hold, has an inverse. 


8.15 Application to Eigenvalues 


If we apply the determinant theorem to an arbitrary matrix A = 
(a;,), 1t follows that as long as the matrix A — xJ has a dominant 
diagonal, then it must also have an inverse; that is, if 


nN 
la;, — x| > 2 Iaixl, 
hay 


k+a 
* The term irreducible is also used. 
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then x is not an eigenvalue. This implies that all the eigenvalues of A 
lie inside or on the boundary of the n circles: 


n 

la; —x| < » [a ;xl- 
ay 
Fi 


Applying the generalized concept of dominant diagonal, we obtain the 
following theorem: 

For an indecomposable matrix all eigenvalues lie inside the union of the above 
circles unless an eigenvalue 1s a common boundary point of all n circles. 

The importance of these circles was first mentioned by Gersgorin 
[11]; later they were rediscovered by Brauer [3]. 


8.16 Generalization of the Determinant Theorem to the 
Study of the Rank of a Matrix 


Another formulation of the determinant theorem is obviously the 
following: 
Let A be an indecomposable matrix for which not all 


n 
la,,| = laiels oe eee 
het 


and let rank A <n — 13 then at least one inequality 


lai:1 < Dd lanl 
k=1 
k#i 
must hold. 
A generalization of this was given by Taussky [35] and Stein [33]: 
Let A be an indecomposable matrix for which not all 


n 
la,,| = 2, luc t=—I1,...,0. 


Kw 


[frank A <n — m, then at least m inequalities 


(i, 3 
la,,| sae la,,.| 
-2] 
k#i 


hold. 
8.17 Application to Multiple Eigenvalues of a Matrix 


From the preceding theorem it follows immediately that for an 
indecomposable matrix an eigenvalue of multiplicity m with m linearly 
independent eigenvectors must be contained in at least m of the Gers- 
gorin circles. 
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8.18 Disconnected Sets of Circles 


Gersgorin showed that, if a set of n, (< n) circles has no point in 
common with the remaining (n — n,) circles, then this set contains 
exactly n, eigenvalues (multiple eigenvalues being counted with their 
proper multiplicities). He showed this by a continuity argument which 
is repeated here for the special case that n, = 1. The general case 
follows in exactly the same manner. 

Assume that the circle in question corresponds to the ith row. 
Construct then a new matrix A’ in which the 7th row is replaced by 
(O0---Qa;,0---0). This matrix has a,; as an eigenvalue, and the 
remaining n —1 eigenvalues come from the (n — 1) x (n —1) 
matrix which is obtained if we omit the :th row and column. Clearly 
the Gersgorin circles of this matrix are contained in the (n — 1) 
remaining circles of the original matrix and hence have no point in 
common with the ith circle. Use now the fact that the eigenvalues 
vary continuously with the elements of the matrix! We go back from 
A’ to A in a continuous transition by increasing the absolute values of 
the elements in the 7th row, but so that they do not exceed the original 
|a,,{ at any time. It is then clear that the eigenvalue moving away 
from a,, cannot leave the original ith circle and that the other (n — 1) 
eigenvalues must stay inside the other (n — 1) circles. 

A special case arises when all circles are disconnected. If this 
happens to a real matrix, then all eigenvalues must be real. For the 
complex eigenvalues would have to lie symmetrically about the real 
axis; since the centers of the circles are on the real axis, the corre- 
sponding circle would have to contain two eigenvalues, which is a 
contradiction. 


8.19 Real Matrices with Dominant Diagonal 


It can be shown that a real matrix with dominant diagonal and 
positive diagonal elements not only is nonsingular but even has a 
positive determinant. Various lower bounds for the determinant of 
such a matrix have been given (see Ostrowski [23], Price [29], Brenner 
[5], Haynsworth [14], Schneider [30]). 

The Gersgorin circles of such a matrix lie entirely on the right of the 
imaginary axis; hence real matrices with dominant and positive di- 
agonal have all their eigenvalues with positive real parts. 


8.20 Eigenvalues of Similar Matrices 


Since similar matrices have the same eigenvalues but not the same 
circles, it is possible to obtain smaller regions inside which the eigenvalues 
lie. For if we consider all matrices S—!AS for all possible nonsingular 
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matrices S, then the eigenvalues of A must lie in the intersection of all 
the circle regions obtained. If A is in particular similar to a diagonal 
matrix, then the eigenvalues themselves, considered as point circles, 
are a special case of such a circle region. 

For practical computations it is particularly helpful to use as § 
the matrices diag (1,..., 1, «, 1,...,1). Such an S does not change 
the diagonal elements but can be used to decrease the radii of the circles. 


8.21 Other Circle Sets 


Since the transpose of a matrix has the same eigenvalues as the 
original matrix, the columns of the matrix can be used instead of the 
rows. We denote the radii derived from the rows by r,, that is, 


and the radii derived from the columns by ¢,, that is, 


n 

C; Peat > PAB 
k=1 
kFt 


It was shown by Ostrowski [25] that the eigenvalues of A also lie in the 
circles with centers a,, and radii 7,7 c;~*, for all « withO<«< 1. 
Other regions which contain the eigenvalues were studied by 
Schneider [31] by using the fact that a determinant vanishes simul- 
taneously with the determinants obtained by permuting the rows. 
Applying this to a characteristic determinant and using the determinant 
theorems of Sec. 8.14, a region which contains the eigenvalues is 
obtained which consists partly of the interior of circles, and partly of 
the exterior. This treatment provides a generalization of the fact, 
observed by Taussky [36] for n = 2, that in this case no eigenvalue can 
lie in the common part of the two Gersgorin circles. For, by Schneider’s 
remark, the eigenvalues for n = 2 have also to lie in the union of the 
exterior of the same two circles, hence cannot lie in the common part. 


8.22 Cassini Ovals 


The following generalization of the determinant theorem of Sec. 8.14 
holds: 
Let 


n nm 
lal [xxl > 2 lau > |a,.,| 
T= r-l 


r#i r<k 


hold for alli, k =1,...,n; 1 #k. Then A ts nonsingular. 
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This theorem was found by Ostrowski [24] and rediscovered by 
Brauer [4], who further utilized it in the same manner as Gersgorin 
applied the determinant theorem of Sec. 8.14 to obtain the circles. In 


this way it is shown that the (3) Cassini ovals 


nt nm 
la;,, — Z| lay, — 2) < 2 lai 2 lar 
ak ak 


form a region inside which all the eigenvalues must lie. 

It was pointed out to the author by J. L. Brenner (unpublished) that 
this argument cannot, in general, be generalized to three or more 
factors. 


NONNEGATIVE MATRICES 


In the Gersgorin circles, as in most work on bounds for the eigen- 
values, the absolute values of the elements of the matrix play a bigger 
role than the elements. This idea is now exploited in more detail. 
See also Fan [46]. 

8.23 Majorization 

If A is an arbitrary m x n matrix (a,;,) with complex elements and 
B = (5,,) is another m x n matrix, such that 6,, > |a,,|, then we say 
that B majorizes A, and we write B > A. In particular, B > 0 means 
that all elements of B are nonnegative, in which case the matrix B is 
called nonnegative. By B > 0 we mean that all elements of B are 
positive, in which case the matrix B is called positive. 

If B > A and m = n, then the absolute values of all the eigenvalues 
of A are at most as large as the absolute value of the maximum eigenvalue 
of B (which is actually itself a nonnegative number, as is shown below). 
This fact can be proved by using the well-known majorizing of power 
series; namely, if |a,| << 8;,2 = 1,..., then the radius of convergence 
of the power series 2 a,x‘ is at least as large as the radius of convergence 
of the power series & £,x‘'. Consider then the power series with matrix 
coefficients ({ — xA)-} and (J — xB)-}. It is known (see, e.g., Mac- 
Duffee [20], p. 98) that the radii of convergence of these series are 
I/|Amax(A)|, 1/Amax(B), respectively, which proves the assertion. 

If some 5, > |a,|, then it cannot be concluded, in general, that 
Amax(B) > |Amax(A)|, aS is seen by the example 


t=f6  ®-( o 
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However, for indecomposable mctrices (see Sec. 8.14), the inequalities 
bie = laixl, 
with strict inequality in at least one case, imply that 
Amax(B) > Amax(A). 


8.24 Primitive Matrices 


An indecomposable matrix A which is nonnegative and has only one 
eigenvalue of maximal absolute value is called primitive. ‘This definition 
can be shown to be equivalent to asking that A” > 0 for some value 
m = m, (sce, e.g., Herstein [15], Ptak and Sedlacek [57]). A positive 
matrix is by definition indecomposable. 


8.25 The Fundamental Theorem Concerning Nonnegative 
Indecomposable Matrices 


The main usefulness of positive and nonnegative matrices lies in the 
fact that important facts are known about their eigenvalues and 
vectors. A nonnegative indecomposable matrix has among its eigen- 
values of maximal absolute value one which is real and positive. This 
eigenvalue is simple, and its corresponding eigenvector can be chosen 
to have positive components. No other eigenvalue has an eigenvector 
with positive (or even nonnegative) components. 

The theorem concerning nonnegative indecomposable matrices goes 
back to Perron and Frobenius and can be proved in various ways. 
The existence of a positive eigenvalue with positive eigenvector for a 
positive matrix follows easily from the Brouwer fixed-point theorem 
(see Alexandroff and Hopf [1], p. 480). Here a proof will be given 
which follows rather closely Wielandt’s treatment [39]. For other 
proofs see Brauer [44], Fan [45], and Householder [47]. 

(a) The main tool in this treatment is to assign to every vector 
x = (X,,...,%,) with all x,>0 (but at least one x,;> 0) the number 
r, defined in the following way: 


i x; 


"z 


If x, = 0, then the quotient is defined to have the value +0. This 
number is also definable as the largest number p for which 


Ax — px > 0. 


It is clear that at least one component of Ax — px is zero; our aim 1s 
to find an x for which all are zero. It is easy to see that the set of 
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numbers r,, when x varies over all vectors described above, contains 
positive numbers and also that it is bounded. It can be shown that 


Y. < max column sum. 


For if we let s be the vector (1,1,...,1), then 


Ax —1rx>0 
implies 
s'Ax — 1,5'x > 0 
"Ax 
or 1, < — < max column sum. 
s'x 


(No column sum is zero, since the matrix is indecomposable.) 

Since the set r, is bounded, it must have a least upper boundr. We 
now prove that there exists a vector x* such that r,. = r. 

Instead of considering the set of all vectors x > 0, we may restrict 
ourselves to considering only vectors x with 2 x; = 1, since 


» ike = > Xi! d X: 
x; x/ > Xx 
These vectors form a closed and bounded set. Hence, there exists 
among them a converging sequence of vectors «x!, x?,... for which 
limr, =r. Let x? + %, where « is again a vector in the same space. 
We only have to show that r, = r. By assumption, 
r>1;. 


On the other hand, 


hence also 


which, however, implies that 


so that we obtain 


(6) We now prove that 7 is an eigenvalue of A and that every vector 
x for which r, = 7 is an eigenvector. For this purpose the following 
lemma is used: 

Let y be any (nonzero) nonnegative vector. Then 


(I + A)""ly > 0. 


This can be shown by proving that for all ¢ the vector (J + A)‘y 
contains nonzero elements wherever (J + A)‘~1y has nonzcro elements 
and at least one more nonzero element. ‘This is a consequence of the 
fact that A is indecomposable. 
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Consider then a vector z such that 
Az—rz=y > 0. 


We know that y > 0 is impossible. However, we multiply the above 
inequality by (J + A)""?: 


A(f + A)*"-1z — r(f + A)" 'z = (4+ A)"y & O. 
The lemma implies that 
A(f + A)""!z — r(I + A)"—1z > 0, 


which is impossible. This implies that _y = 0. 

This again implies that r is an eigenvalue of A and that z is an eigen- 
vector. 

(c) The extremal vectors are positive. Since with z, also, the positive 
vector (J + A)""!z is an eigenvector and since Az = 12, the vector 
(1 + A)""1z coincides with (I + 7)"-!z, a positive multiple of 2. 
Hence 2z is positive. 

(d) ris a maximal eigenvalue. Let « be any eigenvalue of A anc 
let x be its corresponding eigenvector; then 


ax = Ax 
holds. This imphies 


lal x® < A¥x* = Ax*, 


where x*, A* denote the vector and matrix obtained from x and -4 bv 
replacing each element by its absolute value. The inequality 


Ax* — |a| x* > 0 
implies, by definition, 
lal<or. 


(e) No eigenvalue which differs from r can have a positive eigen- 
vector. This follows from the following lemma (see, e.g., Zurmiuhl 
[42], p. 161): 

Let A be any matrix with eigenvalue a and corresponding vector x. Le: 
B # « be also an eigenvalue of A, and let y be an eigenvector of B considered 
as an eigenvalue of the transpose of A. Then y’x = 0. 

We apply this lemma to a = 7, considered as an eigenvalue of .1’, 
when f £7 is any eigenvalue of A. Since the corresponding eigenvector 
of r with respect to A’ 1s also positive, it follows that the eigenvector of 
fB cannot be nonnegative. 

(f) ris asimple eigenvalue of A. First we show that r has only one 
linearly independent eigenvector. Let z be any vector for which 
r, = 17, and let x be another eigenvector of 7, either extremal or not. 
‘Take a number ¢ such that x — cz = y > 0, with at least one y, = 0. 
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Since the vector y is also an extremal vector of r, a contradiction is 
found which implies y = 0 or x = ez. 

Consider next (—1)" times the characteristic polynomial ¢(x) of A, 
that is, [xf — A]. We know that ¢(r) = 0 and want to establish 
¢'(r) #0. Since ¢’(r) can be written in the form 2 X,,, where X;, 
are the cofactors of rJ — A, we see that ¢’(r) is equal to the trace of 
(X,,). The matrix (X,,) has rank 1 for x = 1, hence is not the zero 
matrix. We note that 


(rf — A)(X) = 0. 


Hence, every column of (X,,) is an eigenvector of r or consists of 
zeros. There is only one linearly independent eigenvector whose 
elements are all + 0 and have the same sign. Hence the elements in 
each column have the same sign; similarly, the elements in each row 
have the same sign, since they are the eigenvectors of r with respect to 
A’. Hence all elements of (X,,) have the same sign, which implies 
that its trace is + 0. | 


8.26 An Inclusion Theorem for the Eigenvalues of an 
Indecomposable Nonnegative Matrix 


Collatz [7] proved that the intervals spanned by the quotients 
> a,,x,/x; considered in Sec. 8.25, formed for an arbitrary positive 
vector, always include the dominant eigenvalue. This follows easily 
from Wielandt’s treatment described in Sec. 8.25. 


8.27 A Similar Inclusion Theorem for Real Symmetric 
Matrices and Other Inclusion Theorems 


Collatz [7] later proved by a different method that for real symmetric 
matrices the quotients mentioned in Sec. 8.26, formed for an arbitrary 
real vector without zero components, span an interval which includes 
at least one eigenvalue of the matrix. A generalization of this valid 
for arbitrary normal matrices was obtained simultaneously by Walker 
and Weston [37a] and by Wielandt [40]. The formulation of Wielandt’s 
theorem enables it to include various other previously found inclusion 
theorems. For other inclusion theorems concerning hermitian and 
normal matrices, see, in particular, Fan and Hoffman [8], Kato [18], 


Wielandt [41]. 
8.28 A Problem Concerning Positive Matrices and 
Symmetric Matrices 


In Secs. 8.26 and 8.27 a theorem was mentioned which holds both for 
positive and for symmetric matrices but which necessitates, so far, 
different proofs. There are other theorems of this nature, and it 
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seems desirable to find a unified treatment for both cases. Examples of 
such theorems are: 

1. The dominant root exceeds the diagonal elements; more generally, 
every principal minor of det (AJ — A) is greater than zero if / is larger 
than the dominant root (this is true for positive matrices and for sym- 
metric matrices). 

2. The inequality 


is valid for matrices with nonnegative minors of all orders and for 
positive definite symmetric matrices. 

3. A matrix with all its minors of all orders nonnegative has all 
eigenvalues real and nonnegative; a positive semidefinite symmetric 
matrix has all eigenvalues real and nonnegative. 

4, The eigenvalues of the matrix are separated by the eigenvalues 
of a principal minor of order n — | (this is true for matrices all of 
whose minors are positive and for symmetric matrices). 

Problem 2 was suggested to K. Fan, who subsequently found a unified 
treatment (unpublished). 


8.29 Stochastic Matrices 


From the last remark in Sec. 8.23 it follows that a positive (hence 
indecomposable) matrix cannot have its maximum eigenvalue equal to 
its maximum row sum unless all row sums are equal. This can also 
be deduced from the fact that for an indecomposable matrix an eigen- 
value cannot lie on the boundary of the Gersgorin circles unless it lies 
on the boundary of all the circles. Finally, it can quite easily be 
proved independently by the following argument. Let r be the 
maximum eigenvalue, Iet x = (x,,...,%,) be the corresponding 
eigenvector, and assume 


nn 
r= max > 4,, 
i Kel 


a gh 2 eee 
Let x;, = max x,; then 
n 
rXyp = D Ayike SX dD ye S Xu as D Fix: 
Since ‘ el 
| r = max > 4a, 
i k=1 

we have x, = X,,; hence all row sums are equal too. 

A nonnegative matrix for which all row sums are equal is called 
stochastic. Matrices of this type play a great role in many branches of 
mathematics, as well as in applications—for example, in the study of 
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the transition probability of Markoff chains. For references con- 
cerning stochastic matrices, see for example, Gantmakher [9]. 


8.30 Bounds for A,,,, in a Positive Matrix 


Since in a nonstochastic positive matrix, 4,4, differs from the 
maximum row sum, we may put 


Amax = Max row sum — fP, p> 0. 


A bound for p was given by Ledermann [19] and later improved by 
Ostrowski [26] and Brauer [43]. 


8.31 Completely Nonnegative (Positive) Matrices* 


Completely nonnegative matrices are those in which all the minors 
of all dimensions are nonnegative (positive). Such matrices have been 
studied primarily by Gantmakher and Krein [10]. The eigenvalues of 
a completely nonnegative matrix are nonnegative. Of particular 
importance are the oscillatory matrices. They are completely non- 
negative, but a power of the matrix is completely positive. All the 
eigenvalues of such a matrix are positive and simple. Let 4, > A, > 
}, > +++ be these eigenvalues. The number of changes of sign in the 
eigenvector which corresponds to A, is exactly: — 1. A special case of 


figs a : l 
a completely positive matrix is the Hilbert matrix (—) or, more 


generally, the matrices (-—— | with 0 <x, <x, <:°: <x, and 
i k : 


Qe Wy ee Pg RNY 
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HERMITIAN FORMS AND EIGENVALUES 
by Marvin Marcus 


8.32 Introduction 


We attempt to survey here some of the more recent techniques used 
in investigating quadratic forms, eigenvalues, and singular values of 
linear transformations on the n-dimensional unitary space V,, to itself. 
The discussion is separated into three sections. ‘The first of these is de- 
voted to an exposition of those properties of the Grassmann product and 
compound transformations which we need here and which are useful 
in other problems (e.g., the totally positive matrices of Gantmakher 
and Krein). We then discuss some of the elementary properties 
of convex sets and functions and obtain the essential structure theorem 
for doubly stochastic (d.s.) matrices. From these we can easily derive 
inequalities connecting singular values, eigenvalues, and quadratic 
forms. ‘The final section is devoted to a presentation of the more 
advanced results that have recently been completed. 

Although some attempt has been made to make the material herein 
self-contained, we by no means prove every lemma. Rather we hope 
that the proofs that are presented will convey some idea of the techniques 
and devices that seem to work effectively in dealing with a rather wide 
class of problems. ‘The knowledgeable reader will also recognize that 
the definitions and results are not always presented in their most 
general form. However, we have attempted to minimize complexity 
in the statements of the results, sometimes at the expense of generality. 
Notes and bibliography are deferred to the end. 


8.33 Grassmann Products and Compounds 


A useful and natural tool for dealing with products of eigenvalues 
of a linear transformation 7 is the compound of 7. We list here the 
pertinent definitions and theorems for the exterior product and the 
induced compound in the order in which they are most readily proved. 

We introduce some notation to diminish the number of subscripts 
usually attendant with these objects. 

1. e, is the unit vector with 7th coordinate 1. The binomial co- 


; n\. y 
efficient (7) isn!/p!(n — p)!. By T[ V,, we mean the cartesian product 
1 ‘ 


of V,, with itself p times. | 
2 bb = ny let (Os Si isag ig) Sy ere Se: 
That is, Q,, is the totality of strictly increasing functions on the integers 
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1,...,p into the integers 1,...,”. If a = (t,,...,7,) and w = 
(s+ ++sJp) are two elements of Q,,, then « comes before w in lexico- 
graphic order if there is an integer m, 1 <m <p, for which 1, = },, 
b= lowic, m= land). = 7.. 
Definition 8.1. Let / be a function on [J V, into a vector space W 
1 


such that fis linear in each variable and 


S (Ky oy Xp) = sign wf (Xqyy + 6 +s Xan) (8.1) 
for any permutation 7 of 1,...,. Then / is called a multilinear 
Junction on Il V,, to W. 

Theorem 8.1. For each p = 1,2,...,m there exists a multilinear 


Pp 

function on [J V,, to Vm) such that the smallest vector space containing the 
1 p 

range of f 1s Vimy: 


Proof. Let . = (Xj) +++ 9 %Xin))? = 1,..., p, be any p vectors in V,. 
For w € Q,,,, choose p columns from X = (x,;) with indices w and form 
the p-square subdeterminant so obtained. Arrange these numbers in 
lexicographic order according to the choice of w. The multilinearity 
is immediate, and the last assertion is made clear by choosing x; = «,, 
Pe 1 5 ich: 

We denote the function fin the proof of Theorem 8.1 by 


SE ANS osvagiey) = MATA XS, 


the usual notation for the exterior, or Grassmann, product. Note 


here that the mapping is in general not onto Vimy: For example, if 
p 
n=4andp = 2andv = x, A xq = (v),..., U5), then v,% = Vals — UgV4. 


Theorem 8.2. Let y,,..., 9, belong to V,, and suppose 


yi 2 Juke ¢=1,...,f. (8.2) 
ie 
Then J) A;:*° N Jp a > CuX ws 
WEQ on 
where, ifw = (1, ,1,), then 
Xow = xi A° A xi, 
Dag Die 
and c,, = det 
Ji, _ ‘Spi, 
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Proof 
DN OOKIy = (DX Is) A A CY Dosh) 


= > drs, (5, AD dsXsA°°CA > I 0s*s) 
PY 
= D Diy tisy  Iosts AOA 


Now it is clear that, if any two indices in x, A+*:A x, are the same, 
the value is 0. Moreover, we may restrict our summation to those 
sequences which are in Q,,, as follows: 


WAKA =D [> Jini) °° Inntip 


= > 6X. 
From this result the following two theorems are immediate. 
Theorem 8.3. Jf x; 7 = 1,..., 2, constitute a basis for V,, then the 
Xu, 0 €Q,,, constitute a basis for V,,,. 


Theorem 8.4. xA...A%, — 0 uf and only if x,,..., X, are linearly 
dependent. 
The next result describes the form of the inner product of two vectors 
which are exterior products themselves. 
Theorem 8.5. 
(X,A*°°AXy, Wi A’ **AS,) = det C5) ee eee (8.3) 
Proof. Let 
He = (Xie ey Xin) 
BS es (Js ces > Din) 
1j=1,...,p. Then 


det {(x,,7,)} = det (3 r5 ) 


X1r, eee Nir 


= 2 Sir, - Dor, det 


Kor, °° Xr, 


— > [> J intr) _ "Ip nlry) sign n| det {Xer,}- 
= J det {x,,.} det {7,,}. 


This calculation completes the proof. 
As an immediate consequence we have the following result. 
Theorem 8.6. [f 1 <p <n and x,..., x, is an orthonormal (o0.n.) 
set in V,, then x,, w € Q,,, if an o.n. set in Vem: 
Pp 
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Definition 8.2. Let A be a linear transformation on V, to V,,. 
Let C,(A), the pth compound of A, on V (my to V (my be defined by 
Pp Pp 


C,(A) 6, A+ NG, = Ae Att Ay, (8.4) 


for each basis vector e,,, w € Q,,- 
The following result is an immediate consequence of the properties 
of the exterior product and the relation (8.4). 
Theorem 8.7. [fy,A °° °AJ,€ Vin» then 
Pp 


CAAT AD, = AAA AD, 


The matrix representation of C,(A) is described as follows: If 
X,,...,%, 18 a basis for V,, then the representation of C,(A) relative 
A -square matrix whose entries are the 


p 


p-square subdeterminants of the representation of A relative to x,,..., X, 
arranged in doubly lexicographic order according to row and column 
selections from A. For example, if 


to the basis x,, w € Q,,, is an ( 


Gy, 439 443 
A=] ay, G22 Qo J, 


43, 432 433 


Q1; 412 Qi, 413 Qig 443 
Qo, 422 Qe, 4a Ag; 423 
en C,(A) = Qi, 442 Q\, jg Aye 443 
Q3, 439 Q3, 433 432 33 
G21 422 Qo, 423 Ago = 493 
Q23; 432 Q3, 433 Q32 433 


The following properties of C,,(A) are immediate consequences of the 


definition. 
Theorem 8.8. /f1 <p <1, then 


(i) C(AB) = C,(4)C,(4) ; 
(i1) Cx(A) = C,(A*), A* the conjugate transpose of A; 
(iii) C,(A-1) = C7 '(A). 


(iv) If A ts normal, hermitian, positive definite (or nonnegative), unitary, so ts 


C,(A). 
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(v) The ergenvalues of C,(A) are the (*) numbers A; A,,°°° 4, for 


(t43,..-+2,) € Qin, where Ay,..., A, are the eigenvalues of A. 
Proof. 


(i) If x, is a basis for V,,., then 


(3) 

C,(AB)x,, = (AB)x; A °° +A (AB)x;, 
= C,(A) Bx; Avs A Bx; 
= €,(4)C,(8) x. 


(ii) [C,(A)xp A A Xp WA TADS 
= det {(Ax,,9;)} = det {(x,,4*y;)} 
= [AoA Xp, O(A*) AAI): 
(iii) C,(AA“) =C,(7,) = qin. 
(iv) For example, if A is normal, 
C,(A)Cp (A) = C,(A)C,(A*) = C,(AA*) 
= C,(A*A) = C,(A*)C,(A) 
= C7 (A)C,(A). 


(v) Assume that A is triangular. Then it is a direct calculation to see 
that C,(A) is triangular with diagonal elements [eigenvalues of 
C’,(A)] precisely the numbers 4, -- “A, . But any matrix can be 
unitarily triangulated, and C,(A) is “similar to C pUAU) = 
C,(U)C,(A)C,(U). 

This completes the proof. 

Definition 8.3. If A is a linear transformation on V,, to V,, then the 
singular values of A are the nonnegative square roots of the eigenvalues of 
A*A. 

Let 

JA,| Sree SIA 4 Scr: Sa, (8.9) 


where A, and a,, 7 = 1,..., a, are, respectively, the eigenvalues and 
singular values of A. It is clear that 


[Ay] = [(Ax,x)| < || Axl] = (Ax,.4x)"? = (A*Ax,x)'* 


(8.6) 
< ay 


for any A and x, the normalized eigenvector of A corresponding to 4,. 
Now the singular values of C’,(A) are nonnegative square roots of the 
eigenvalues of C’,(A*A), which are in turn the numbers a, ---4,. 
Hence, by (8.5) and (8.6) applied to C,(A), we have the following 


result. 
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Theorem 8.9. Let A be an arbitrary linear transformation on V,, to V,, 
with eigenvalues and singular values (8.5). Then, forl <p <n, 


Pp 
IT 141 < I] Oj, (8.7) 
with equality for p =n. 

In the next section we introduce some elementary properties of 
convex sets and functions and use them to obtain some of the easier 
consequences of (8.7). It turns out that our methods also give us infor- 
mation on the extreme values of functions of quadratic forms associated 
with an hermitian matrix. 


8.34 Convex Functions and D.S. Matrices 


Definition 8.4. A set M in real euclidean n space E,, is convex if the 
line segment joining any two points of M consists entirely of points in M/: 
xe MandyeM and0<6< 1 imply 


6x + (1 — Oye M. 


Definition 8.5. A real-valued function f on the convex set M is 


convex if 
f(x + (1 — 6)9) < f(x) + (1 — 4) f(y) 


forxe MjyeM,0<6< 1. 
Definition 8.6. An n-square real matrix S is doubly stochastic (d.s.) 
if all elements are nonnegative and any row and column sumis1. The 


totality of d.s. matrices of size n x nis denoted by M,,. 
Definition 8.7. If a;, 7 = 1,...,m, is a set of points in E,, then 
H(a,,.--5@,), the convex hull of the a,, is the set of points defined by 


m 
x= > t,a;, t; 20,7 =1,...,m, 
j=1 


ae 
j=l 


We remark that, if f is a convex function on H(a,,...,a,,), then 
xe H(a,,..., @,) implies that 


f(x) =f( +0 S24 F(a) 
Ls / i 
< max f(a,). 
j 
Hence the maximum value of fis achieved at a vertex. 


Our first result concerning M, is the following. The proof is some- 
what long but is constructive and entirely elementary. 
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Theorem 8.10. Let P,,..., Py, m =n! be the permutation matrices 
of sizen. Then 
ME TAP oy ia Pale (8.8) 


Proof. It is immediate that 
WP igaccg kh Ve MM. 


Now let a “‘general diagonal”’ of any n-square matrix A be a set of n 
positions in A each of which occurs precisely once in each row and 
column of A. The first information we need is contained in the follow- 
ing lemma. 

Lemma. Let S be a set of elements of A. Then every general diagonal of 
A intersects S uf and only uf S contains ans x t submatrix withs +t =n + 1. 

Necessity: It is clear that, by permuting rows and columns, a general 
diagonal goes into a general diagonal. Hence we lose no generality 
by assuming that A has the form 


T, T. 
t-(5° 
ST 


with dimensions as follows: 


S: s xt, stt=n+1 
7, (n—s) xt 

T, (n —s) x (n — 1?) 
T; s xX (n —?) 


We may assume ¢ > 5 without loss of generality. If d is a general 
diagonal not intersecting S, it must go through precisely ¢ columns of 7, 
and thus through precisely ¢ rows of 7,. ‘Thus, since ¢t > s, d does not 
intersect 7,, and hence intersects 7 in precisely (n — t) rows. But 


n—t=s-—l, 


and hence d lies in at most (s — 1) rows of 73. But 7, has s rows and 
consequently d does not lie in every row of A, a contradiction. Thus 
every diagonal hits S. 

Sufficiency: The proof is by induction. For n = 1 or §S = A the 
result is clear. Otherwise we assume there exists an element a,, not in 
S and let B be the minor ofa,;. If dis a diagonal through a,,, then, 
since d 1S #0, it must intersect § in elements of B. On the other 
hand, any diagonal of B can be extended to a diagonal of A by adjoining 
a,; and hence any diagonal of B hits S (since a;,¢ S)._ By induction, 8 
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contains a p x g submatrix S, of elements of S with p +q =n. By 
permutation of rows and columns of A, we may assume A has the form 


Ly. a 
des ( 1 ‘) 
S, Ts 


where the dimensions are as follows: 


he px 4, ptq=n 
1: (n—p) Xq=qXq 
T: (n—p) x (n-—Qh=q xp 
T3: p x (n— 4) =p X p. 


Suppose there is a diagonal d, of 7, such that d,. NS = 0. Then we 
adjoin any diagonal d, of T, to d, togetadiagonaldofA. Butd NS # 
0, so dg 1S #0. Hence, if some diagonal of 7, does not hit S, then 
every diagonal of 7; does (and conversely). So we assume that every 
diagonal of 7, hits S. Then 7, has au x v submatrix S, consisting of 
elements of Swithu +v—=—gq+1. Itisclear that we may combine S, 
and S, to obtain a submatrix S$, of A with dimensions (u + p) x v con- 
sisting of elements of S. Now 


utptu=utov+pHHqtl+p=H=qtptl=ansti. 


The lemma is thus established. 

Proceeding to the proof of the theorem, we let A € M, and let S be the 
set of zero entries of S. If every diagonal of A hit S, then there would 
exist ans x ¢ submatrix S, of A consisting of zeros withs + ¢ =n + 1. 
The complementary matrix of S,;—call it 7,—has dimensions (n — s) x 
(n — t). We may assume that A has the form 


A= (? 7 
T; 7, 
Now the sum of the elements in 7, is 5 and in 7, is ¢. Hence the sum 
over 7, is n — (s + t) = —1, animpossibility. Now let d, be a diag- 
onal of A with no zeros in it, and let ¢t, be the least element in 4d). 


Then, if ¢; = 1, itis clear that A is a permutation matrix; otherwise we 
choose P,, a permutation matrix with Is in positions corresponding to 4). 


Then A, — A—4P, 


a * is d.s. and has at least one less positive element 
— Ay 


than A. We then proceed as above, using A,, and after at most (n? — n) 
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steps, we get a matrix with exactly n positive elements in it (fewer than 
n would contradict the d.s. property) which must all be Is. Thus 


A=) 1 Ff, t, > 0. 
j=1 


me 


Now > ¢; = 1 is clear from the fact that A is d.s. This completes the 


=1 
proof of Theorem 8.10. 
We note here that the proof shows that any particular § € Af, is in the 
convex hull of no more than (n? — n + 1) permutation matrices. 
Definition 8.8. If x, > x, >:-- > x, is an ordered set of real 
numbers, we define K,(x) to be the intersection of half spaces and a 
hyperplane: 


tote th, Sate te (8.9) 
hte thax tet, (8.10) 
l<k <n, | ee ee 


If ¢ € K,(x), we indicate this with the notation [¢] < [x]. 
Theorem 8.11: 
K,(x) = {y|y = Sx, Se M,}. (8.11) 


In other words, X,(x) is the convex hull of the n! points Px where P 
runs over all n-square permutation matrices. 

Proof. The argument can be done directly in terms of the support 
function F of the convex hull ZL = {Px} where P ranges over all n-square 
permutation matrices. This is defined as follows: 


F(u) = max (4,2), 
zeL 
and it is clear that 


(t,u) = F(u) 


is by definition a support plane of Lforanyu. It 1is also true that Lis the 
intersection of the half spaces (t,u) < F(u) as u ranges over all n vectors. 
In our case we easily check that 


F(u) = max (u,Px). 
P 
Now suppose ¢ € L; then 


t= > w,Px, wo, =O, > o,.1, 


$0 > basics. pS (Px), = Dx, 
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Moreover, let u be a vector with coordinates ?,,..., 7, equal to 1 and 
the rest 0. Then we check that 
k 
a t, = (u,t) < max (u,Px) = 24+ 
J atte 
Thus Lc ee. On the other gk let 
S(t) = (ut) 


for a fixed u and ¢e€ X,(x). Since / is linear in ¢, it must assume its 
maximum value on one of the support planes (8.9). Thus assume 


max f = f(t) and 


k k 
D4, = 2D % l<y< aie en sel 

j=l : j=l 
Let {i a (t,5 ee ey te), {2 met (ti eoeg rae 
ae Ceres 8 PANN a at hoe wo) 


Then we check easily that 


(]) < [x],  [#] < [**]. 
Since the theorem is true for n = 1, we obtain by induction that 

ee a S,éEM, 
and eS. S,E M,_, 
But then ¢ = Q(i, + 4) = Q(S, + S,)x, where + indicates direct sum 
and Q is an appropriate permutation matrix. Thus 

K,(x) < L, 
and the proof is complete. 
Theorem 8.12. /f f(t,,..., t,) 1s a function such that f(é,..., &*) 


15 convex, nondecreasing in each t,, and symmetric and tf A is an n-square complex 
matrix with eigenvalues and singular values given by (8.5), then 


PAs Af es ee oy) (8.12) 
where. <k <n. 
Proof. From (8.7) we have 


Pp 


2 log |A;| < Log X55 l<p<k. (8.13) 


Let x; = log a, :k= 1, 


k 
x, = log a, — fine — ¥ log 1 ). 


The first (A — 1) inequalities (8.13) are unchanged when x, replaces 
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log x, and the Ath inequality is changed to equality. If we set r = 
(log |/,|,..., log |4,|) and observe that x, >-+-+- > x,, we conclude 
from (8.11) that 

f == Xx; 


where S e€ \,. Let g(t) = f(¢,..., ¢), and we observe that 


Sarl, sees [Agl) = gtr) = (Sx) = (Px) = g(x) 
< g(log x,,..., log x,) 
==) (ay, ee 8g a,). ‘8.14: 


The first inequality in (8.14) follows from the fact that g(Sx) is a convex 
function of S and hence assumes its largest value on a permutation 
matrix. The remaining steps are immediate consequences of the non- 
decreasing and symmetry properties. We remark that, iff(t,,...,4,) = 
7 +-++ + t,°, 0 > 0, then, from (8.12), we have 


JAI? tee + Ag? Say? bv + oy! (8.15 


10h Soy ohm. 

It is also clear that analogous results for f concave can be obtained in 
a similar way. 

Fortunately the methods used to get results like (8.12) give us informa- 
tion on functions of quadratic forms with practically nolabor. Consider, 
for example, the following theorem. 

Theorem 8.13. Let A be an n-square complex hermitian matrix with 
eigenvalues Ay D+ ++ DA, Let x,,...,%, be an o.n. set of vectors in V,, 
l<k<n. Then 


k 


k k 
> Anejet = > (Ax,,x;) = > h;. (8.16, 
Pee] J=1 


j= 


Proof. Let u,,...,u, be an o.n. set of eigenvectors of A with 
(du,u,) = A,t=1,...,2. Then 


(dvix) = > As M(x) 
j=l 


Now, complete x,,..., x, toano.n. basis for V, by adjoining x,.,,.. 
x,. It follows immediately from the o.n. properties that 


S = (|(x,,u,)|?) € M,. 


ae 


Also, 


(Ax,,x,) ae (S,,4), 


where S; is the ith-row vector of § and A = (A,,...,4,). So define 


FASE, 


J=1 
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and since / is linear in S, the proof is completed by noting Theorem 
8.10. Vk 

In case A is positive definite, we can use the concavity of (11 ) 
in exactly the same way to get aa 


k 
U (Ax,,x,) > il An—se1 (8.17) 


A nice application of (8.17) is the following proof of the Minkowski 
inequality: 


oi ila OC) 


= IT (Ax,,x,)"/" + [I (Bx,,x,)"" 
j-1 j=l 
> [Al + |By. (8.18) 


Here A and B are positive definite hermitian matrices, and x,,..., X, 
are an o.n. set of eigenvectors of A + B. By choosing x, = e, and 
k =n, the relation (8.17) becomes the statement of the Hadamard 
determinant theorem. 

It is known that E,'” and E,/E,_, are concave, nondecreasing func- 
tions where £, is the rth elementary symmetric function. Itis clear that 
both (8.17) and (8.18) can be extended with little effort to similar 
results for these functions. 

Actually, (8.17) can be considerably sharpened by using the fact that 
C,(A) is positive definite hermitian when A is; for 


E k 
Il Ay = (Cy(A) xp As AX, HEAT AX) > It An—s+s (8.19) 
P Mand i 

and an application of the Hadamard determinant theorem to {(Ax,,x;)} 
gives (8.17). 

We just mention here that a function satisfying f(Sx) < f(x) for 
S € M,, with equality if and only if S is a permutation matrix is called 
Schur-convex. It is clear that this property is made to order for proofs 
like those of Theorems (8.12) and (8.13). 

Recently (8.19) was extended in a way that generalizes (8.16) for 
positive definite hermitian matrices. We do not include the proof. 

Theorem 8.14. Let A be positive definite and let x,,..., x, be an o.n. set 
of vectorsin V,. Then, forl <r<k <n, 


jd ey ae x [C.(A) xoxo] < E,(Ay,---5 A,). (8.20) 


We note that (8.20) is not an immediate consequence of (8.16). For 
although the sum in (8.20) is over a set of o.n. vectors in Vem): the 
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rk\ 
bounds are not necessarily sums of the largest or smallest (; eigen- 


values of C,(A). The reason for this is that we are not taking extreme 
values over all o.n. sets. A result related to (8.20) is the following: 


If A and B are n-square complex matrices with singular valuesa, >-+-- > 12, 
and Bj >-+-+: >AB,, respectively, then 
n 
ltr (UAVB)| < > 2,6; (8.21; 
j=l 


where U and V are any two umtary matrices. 

Moreover, the upper bound in (8.21) is assumed for appropriate 
choicesof Uand V. This result has recently been generalized in several 
ways and references will be found in Sec. 8.36 and in the bibliography 
at the end of the chapter. 


8.35 Intermediate Eigenvalues 


A problem somewhat different from those considered in Sec. 8.34 is 
the following. Given a set of some of the eigenvalues of A—say 4,, 
Aj» +++», 4,,—and a function f(t, ..., ¢,), find an extremal characten- 
zation of f(A, ,..., A,,) in terms of f[(Ax,,x,), ..., (Ax,,%,)] for x, ..., 
x,,ano.n.setin V,. The well-known Courant-Fischer result is of this 
type, and the following theorem is a generalization of this as well as of 
Theorem 8.13. 

Theorem 8.15. Let A be an hermitian matrix with eigenvalues 
A, Sees SA, andletl <i <-+- <a sn Letey >--+ De, bek 
nonnegative real numbers. Then 


k k 

> ¢;4;, = max min } ¢,(Ax, ,x,). (8.22) 
j=l R z j=l } 

The notation here is the following. For fixed subspaces R; © +++ ¢ 

R,, with dim R,, = 7;, min indicates the minimum over all possible o.n. 


with x, € R;. Then max min is the largest of these 
R x 
minima as the spaces R, © -:- © R, vary. 
A result which follows from (8.22) is the following (see also Wielandt 
[23]): 
Ifa, Soet DAy wy Scott Su, andy, Ds Dy, are, respectively, 
the eigenvalues of A, B, and A + B (A and B hermitian), then 


Sets X; 6-2-5 Xj, 


y=A+t Su (8.23) 
for Se M,,. 
A further generalization of the type of Theorem 8.15 for symmetric 
functions is as follows. 
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Theorem 8.16. Let y(t,,..., ¢,) be symmetric and nondecreasing for 
te [AA]. Then ifl <i <-+++ <i <a, 


g(A,,--+54,,) = sup inf Z(x,,..., x,). (8.24) 
Rf 


Here the notation is the same as above and ¢(x,, ..., x,) is by defini- 
tion ¢(A,,..., 4,) for 4,, the eigenvalues of A restricted to the space 
spanned by x,,...,x,- (By this we mean 4, are the eigenvalues of 
PAP where P is the orthogonal projection into the space spanned by 
eee Te 

From (8.24) various inequalities connecting singular values are 
obtained. 

In the case in which A is simply hermitian, the extreme values of 


TI (Ax;,x,) (8.25) 


have been investigated. Moreover, a theorem analogous to (8.20) for 
A indefinite is known for arbitrary r. Ofinterest would be a generaliza- 
tion of these kinds of results to a more general class of transformations. 
Results describing the structure of boundary points of Af, would also be 
useful in attacking these problems for a more general class of functions 
than the convex functions. 


8.36 Notes 


Sec. 8.33. The material on compounds can be found in [4], [15], 
and [21]. Theorem 8.9 was proved first in [22] in the way we have 
used here. Later a different method yielded the same result (see [6]). 

Sec. 8.34. The proof of Theorem 8.10 used here is found in [5]. 
Another shorter (but not constructive) proof is found in [10]. The 
result was originally published in [3]. Theorem 8.11 is proved in [18] 
and [2], as well as in the book “‘Inequalities,’ by G. H. Hardy, J. E. 
Littlewood, and G. Polya. Theorem 8.12 in the case in which f/f is 
Schur-convex may be found in [18] (the proof is the same). The 
argument, however, was originally used in [19] andalsoin[7]. Remark 
(8.15) is found in [22], [7], and [6]. Theorem 8.13 was proved first in 
[6], with a somewhat different argument. The proof given here has 
been used in various forms in [2] and [13]. Further extensions are to 
be found in [17]. The remark (8.17) is found in [9]. The extensions 
to symmetric functions are to be foundin [12]. Theorem 8.14 isin [15] 
and in somewhat extended form in [16]. Actually it is easy to show 
that (8.24) implies (8.20). The result (8.21) appeared first in [20]. 
Later a new technique and a generalization of (8.21) appeared in [8]. 
In [16] some extensions of the results of [8] are obtained via (8.20) and 
the results in [11]. 
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Sec. 8.35. Theorem 8.15 was first proved for c; = 1 in [23]. The 
form given here is found in [14]. The remark (8.23) is found in [23] 
also, and there certain inequalities for Schur-convex functions of the 
eigenvalues are obtained. The result (8.24) is found in [1]. K. Fan 
obtained the same result for a more restrictive class of functions before 
the appearance of [1]. Ina thesis at the University of British Columbia, 
R. Thompson independently obtained (8.24) for ¢, an elementary 
symmetric function. This work, however, was not submitted for pub- 
lication after the appearance of [1]. The remark (8.25) is found in [17]. 
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9.1 Introduction 


We consider, throughout, methods for the numerical solution of the 
initial-value problem for systems of real ordinary differential equations 
of first order, 


HLH Mees), — IM (Xo) =o” — (= 1, 2,. ~~, m). 


Our aim is to give the fundamental principles on which most of these 
methods are based, rather than to attempt a complete presentation of all 
such methods. For a more detailed treatment and a discussion of 
methods for the numerical solution of boundary-value and eigenvalue 
problems, we refer to the books listed in the bibliography—for example, 
to those of Collatz, where further references to the ample literature on 
the subject can be found. 
In the following we always write the basic system in vector form, * 


dy 
— = f(x); I(%) =Jos cae. 
dx 
* To attain more symmetry, some authors write (9.1) as 
dw 
a (zu), W(X») = uy; 


by setting w = (;) aS (;) ao = (;°) . This artifice appears to go back to 
J 0 
Nystrom ([46], p. 9). 
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and denote by ||_y|| the usual euclidean length of y. Without further 
mention, we tacitly assume that there is a set J x Rcontaining (xy,yy) in 
its interior on which f(x,y) is continuous and satisfies 


I f(x,u) —f(x,2) Il < K llu — ol (9.2) 


for some constant K > 0, so that there exists on some interval [x),X] < J 
a unique solution y(x) with y(xy)) =. Recall that, if f(x,y) 1s of class C*, 
s>1,onl~x R, then (x) is of class C*+! on [x9,X | and, if R is convex and 
|| Of (x,y) /0y"|| is bounded on J x R foreach wp = 1, 2,..., m, then (9.2) 
holds. 

All methods for the numerical solution of (9.1) considered in the 
following yield, for a sequence {x,,} of abscissas x, > %9, a sequence { y,} of 
vectors y, which approximate to the (exact) vectors y(x,,) of the desired 
solution. In the first part, Secs. 9.2 to 9.5, we consider the so-called 
one-step methods, which define »,., as a function of x,,,, X,) Jn. A 
famous representative of this group of methods is Kutta’s generalization of 
Simpson’s rule. The second part, Secs. 9.6 to 9.10, is devoted to multi- 
step methods, in which y,,,, 1s defined as a function of x43, Xn—»s Vn—p» 
p =0,1,...,kforsomek >0. Here, the classical example is Adams’ 
method. 

The kind of difficulty one is likely to face is illustrated by the simple 


equation, 


z= te AE 2) = (_'), (9.3) 


in which a > 0 1s some constant. Its general solution is 


1) ] | 
J(x) — aee(_ 3) at ca*( 04) 


and the particular solution desired is obtained for c, = 1,¢c. = 0. No 
matter what approximate method is used, it will introduce a ¢c,-com- 
ponent due to round-off errors, and if at x = & this component is 7, it 
will be ne?" atx = &€ + 2. Thus, it will ultimately totally overshadow 
the desired solution and lead to entirely spurious results, regardless of 
how many decimals are carried in the computation. 

Though in a particular problem the situation may not be so bad, this 
example should serve nevertheless as a warning: numerical integration 
should not be undertaken when no definite information about the desired 
solution is available. Careful analysis of the problem at hand must 
always precede the start of computations. 
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RUNGE-KUTTA METHODS 


9.2 Method of Taylor’s Series and General Runge-Kutta 
Method 


Every one-step method can be written in the form 
Ins1 =In + AnP(XnInihn) (2 = 0,1,2,..-), (9.4) 


where 9(x,7;h) is a vector-valued function and h, = x,,,; —*,- The 
choice of g should be “reasonable’’ in the sense that for fixed (x, y) el x R 


G(x 93h) > f(x,y) ash +0. (9.5) 
If (x) is the exact solution of the differential equation and 
r(x,h) = y(x) + hg(x,y(x) 54) — (x + 4), (9.6) 


then for any fixed x € [x,X ] 
r(x,h) = o(A) ash —+ 0. 
The vector r(x,h) is called the truncation error at the point x. If p is the 
largest integer p’ with the property that 
r(x,h) = O(A? +1) ash + 0, (9.7) 
then p is called the order of the method (9.4). 
The simplest choice of satisfying (9.5) is 
(x34) = f(x). 
The corresponding method, * 
Ins =In tf (Kn In)s (9.8) 
was proposed by Euler in 1768. If f €C’', it has the order p = 1, since 


(xh) = y(x) + Af [x,9(*)] —y(* + &) = (x) + y"(x) — 9% + 4) = O(A?). 
This also shows that the method, in essence, makes use of the first two 
terms of Taylor’s series. 

A natural extension of (9.8), then, is the so-called method of Taylor’s 
series (also proposed by Euler), which takes into account the first (p + 1) 
terms of Taylor’s series. To describe it, let feC?, p > 1, and set 


| fll xy [A(x 
F%x,3) =Sle Seay) = Lt 4 AE fea,y 


k =0,1y...,p —2) (9.9) 


* In this and in similar formulas in the sequel, we write simply A instead of 4,, 
with the understanding that 4 may or may not depend on n. 
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where 0f'*!/dy denotes the m x m matrix having as columns df!*!/a)4 
(4 =1,2,...,m). Clearly, 


Sf! (x,y(x)) = 94 (x). (9.10) 
Taking, then, 


G(x, 934) = Ss eet) 


k=0 
we are led to a method of order f, since, by (9.6), (9.10), 


k 


r(x,h) = y(x) + hk asi resUT, PAY (x) — p(x +h) = O(hPt), 


Although for special systems of differential equations, notably linear 
systems, the method of Taylor’s series may be used quite efficiently (for 
an example, see Chap. 2), its applicability in more general cases is rather 
limited, because of the rapidly increasing complexity of the expressions 
in (9.9). It was Runge [52] who, in 1895, first pointed out a possibility 
of evading successive differentiations and of preserving at the same time 
the increased accuracy afforded by Taylor’s series. Runge’s method 
was subsequently improved by Heun [26] and Kutta [32]. 

Kutta’s proposal (somewhat more general than a similar one made by 
Heun) consists in setting up g with undetermined parameters as follows: 


¢(x, 93h) = 2 asks, 
ki (x,y) = f(x,y), (9.11) 
a—l 
k,(x,934) AG + ph, y +h Ayk;) (SSS 2 a coag 1). 
jai 


Given r, the number of “substitutions” into f, the parameters a,, 44,, A,; 
are to be determined so as to make the order p of (9.11) as large as possible. 
We note that in (9.11) f € C* implies m € C* for A sufficiently small. 
Expanding the local truncation error formally into Taylor’s series, 


© 1 [ d*r(x,h) 
et -S al dh* I 


k=0 


and noting from (9.6) that r(x,0) = 0 and that 


ale a (k > 0), 


we find that Kutta’s method is of order p, if fe C? and 


piss a ee. ; {=0 forl <k <p, 
dh*- rae (x l\40 fork=pH+l. 


(9.12) 
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More detailed calculations show that the p identities in (9.12) are 
equivalent to a set of, in general, nonlinear equations for the parameters 
Xs» My, 4,; For each r, there will be a largest value of p, p = p*:r., for 
which these equations are solvable, and it turns out that 


p*(r) =1 fl <r <4. 


The corresponding solutions have a certain degree of freedom, whichis 1 
forr = 2and2forr = 3andr = 4. Their actual derivation, however, 
is of considerable length, and the calculations become rapidly more 
complex as ris further increased. For example, if r = 5, Kutta [32] 
obtains 16 equations in 15 unknowns, and it appears as yet uncertain 
whether these equations are dependent. Thus, p*(5) > +; similarly one 
knows only that p*(6) > 5. Corresponding formulas of order + and 5 
are listed in [46]. Formulas of order 6 utilizing eight substitutions are 
derived in [28,29]. 


9.3. Examples of Runge-Kutta Formulas 


To derive a particular set of Runge-Kutta formulas for a fixed rf >1., 
one proceeds as follows. First one obtains, in terms of the partial deriva- 
tives of f(x,y), the partial derivatives of ¢ (x,y 3h) with respect to A, by using 
(9.11), and the (total) derivatives of y(x), by using (9.1); this requires 
(p — 1) differentiations ifa method of order p is desired. ‘Then one sub- 
stitutes these expressions into (9.12) and satisfies the p identities in (9.12) 
by equating to zero the coefficients of the various partial derivatives of 

f(x,y). Under the natural assumption that 


s—1 
HM, = DdA,; (C2 eerere oe (9.13: 
j=l 


one finds, for example, for p = 1, 2, or 3, that the first, the first two, or 
all three of the following equations must be satisfied: 


Deeps 1, (9.14) 
sl 
> 2.4, = %, (9.15) 
r r a 1 
> at? = 236, >A = 4% (9.16) 
g=2 s=3 t=2 


Suppose we require only a method oforder p > 1, so that solely (9.14: 
must hold. Ifr = 1, then (9.13) drops out, and 2, = 1, which yields 
Euler’s point-slope formula (9.8). Ifr > 1, we have to satisfy r equations 
with 27 — 1 4 Jer(r — 1) variables, which leaves us r — 1 + lor(r — 1) 
free parameters. 
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Similarly, for a Runge-Kutta method of order p > 2, both (9.14) and 
(9.15) must hold, which requires r >2. There are r—2 + Mar(r —1) 
degrees of freedom in the choice of the parameters. For example, if 
r = 2, we have 


a = 1 — a,, Haha 
2 


Note that r = 2 violates the second relation in (9.16) so that p*(2) = 2. 
In the modified Euler-Cauchy method (or the improved point-slope for- 
mula) the choice a, = 1 is made, for which (9.4) reads 


Ina =In + Af [Xp + YA, In + WAL (Haydn) ]- (9.17) 


In Heun’s method (or the improved Euler-Cauchy method) «, = % is 
chosen, for which (9.4) reduces to 


Invi =In + AMS (Xn In) +L + bs In + AF (Xn In) J. (9-18) 


Runge-Kutta formulas of order p > 3 can be obtained ifr > 3; they 


can be chosen from a set of formulas with degree of freedom [r — 4 + 


Yor(r —1)]. Ifr =3, they are given by 


= 2 — 3( fg + Ms) + OMofls 


Hy 


Opo/ts 
bie 2— 3s os 2 — 34s 
. 6u2(H2 — Ms)” : 6f43(H3 — He)” 
ji a Eg aa) 
zs 7 » fo(2 — 3p) Ha(2 — 341) ° 


as follows from (9.13) to (9.16). If uw, = 4, ws = % is chosen, then 
(9.4) becomes 
Jn+1 FDn 2 Yah(k, a 3k), 
ky =f (XaIn)s (9.19) 
ky =ft%n + 73h, yn + POA [Xn + 23K, In + BME (tas In) Ds 
which is also referred to as Heun’s formula. The choice np, = 
ylelds Kutta’s third-order rule: | 
Jn+1 =J)n . Yeh(k, =a tk, a ks), 
ky =f (XasIn)s (9.2(1) 
ky =f {x, a yeh, Dn ae Lahf (Xn In) ]s 
ks == f {x, oe h, Da “> hf (Xn¥n) = 2hf (x, + Loh, y, fe Lohf (x,,9n) ]}- 
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To find Runge-Kutta formulas of order p = 4, we must have r > 4. 
Ifr = 4, there are obtained 11 equations containing 13 parameters. A 
particular solution suggested by Kutta is the set of constants 


i =e, %, a =a, = 3%, 


He=Wge= 2, m=), 
doy = Age = 1, As = 1, Ag, = Ag = Age = 0. 


This is the choice made in Runge-Kutta’s method, for which (9.4) reads 


Ins1 =In + Yoh(ky + 2k + 2ky + ky), 

ky =f (Xn In) 

ks =f (*, 5 Joh, Vn a Yehk,), (9.21) 

ks =f (x, 1 veh, In = Yahks), 

ky =f (Xn + Ay In + hks). 
Note, that, iffis independent ofy, both formulas (9.20) and (9.21) reduce 
to Simpson’s rule. 

Gill [19] proposes a solution for the same 11 equations under the addi- 

tional requirement that the vectors y, + A(Ajk, + Aik), t = 3, 4, and 


Yn t+ A(a,k, + ak.) be linearly dependent. He adopts, from among 
others, the set of constants 


a, =a, = %, a, = %4(1 — Vv), a, = 4(1 + V4), 
Me=og =, wy =], 
dn =, gy = 44+ V%, Age = 1 — V%, 
Ay =90, Ay =—V%, Ag =14 v4, 


with which (9.4) becomes 


Jnsr =JIn + VOALR, + 21 — ViB)A, + 21 + V14)ks + Aad, 


ky =f (Sia) 
kp = f(x, + Yh, 9», + Yehk,), (9.22) 


( 
ks = f(x, + 4h, y, — (38 — VM)hky + (1 — V4)AR,), 
ky =S (tn + hyn — Wishky + (1 + V4) Aks). 


Still other formulas are described in [31] and in [13, 38], where also 
their adoption for use on high-speed computing machines is discussed. 
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More accurate variants of Runge-Kutta’s method, involving also substi- 
tutions into partial derivatives of f, are given in [17, 36, 47]. Further 
one-step methods with 


¢ (x, 934) = Lasts + phy + BAG (x I3Hrh, «++ Meh), 
where the ¢, represent auxiliary one-step methods, are considered in [23, 
25]. 
9.4 Differential Equations of Higher Order 


Because of their importance we consider in more detail systems of 
second-order differential equations 


v” = g(x,v,v’), (Xo) =="; U(X) =U; (9.23) 


where v and g are vector-valued functions and g is suierenty smooth on 
asuitable set! x R x S. 

The system (9.23) can be written as a first-order system [of twice the 
size of (9.23)] if we introduce the column vector w = v’ and write 


w 
7=(—), H601= (yaa) = (2) 020 
Then (9.23) is equivalent to 


Dv =f(%y), W(X) =o: (9.25) 


Any method of Sec. 9.3 can now be applied tothe system (9.25). The 
resulting formulas may then again be separated into their v and w parts 
to obtain formulas for the computation of v and w =v’. 

For the Runge-Kutta formula (9.21) we find, after a short calculation, 


Uns = VU, + hv, + Yeh? (, + 1, + ds), 


9.26 
ee Sv Mh, Ol, | 2h 41), (9.26) | 
where = 1, = g(x,,Uq,U;,), 
2 = a(x, + 2h, v, + Yehu,, v, + 72hl,), aa 


E 
lp = g(x 
ly = 8(%q_ + 72h, vy + Y2hv, + “AlPly, vy + 72hl,), 
l, = g(x, +h, v, isi + Veh?l,, vi, + Aly). 
By construction, this method is of order 4, in the sense that 
v, — v(x» +h) = O(A'), v, — u' (xy +h) = O(AS). (9.28) 
In analogy with Kutta’s procedure [see (9.11)], one could start from 
the outset with a system of formulas of type (9.26), (9.27) and use 
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undetermined coefficients to search for more accurate or more con- 
venient formulas. This was done in 1925 by Nystrom [+46], who sets 


r 
c 
On+y = Uy ot; hv, at h? > a,l,, 


sa (9,24 


r 
, S 
Un. i Cea Un sa h > Bl 


g=1 


with A= a(x, asl, =) s-} 
1, = g(x xy + ph, v, + pv, + APY das Uy FAD el,) 


( SF ind Ve ada 


He finds, among others, the following particular simple set of formuts. . 


ly ae g(x, asic) 

1, = g(x, + Loh, v, + Vohe', + Veh?l,, vf + Lehly), 

ls = g(x, + leh, v, + Yahe, + Yer*l, vy + hl), 0.3) 
lL, = g(x, +A, v, + Avi, + YohPls, vl, + Als), 


from which 2,,.,, v,,, 1s obtained as in (9.26). The resulting methoc « 
also of order 4 but in general involves fewer calculations than (9.27), sinc: 
the first two arguments in /, and /, are the same. 


A method which extends Nystrém’s method (9.26), (9.31) to system: | 


of mth-order equations 


vi) = g(x,v,07,.02, 00D), W(x) = ap" (w= 0, 1... ms 


was developed by Zurmihl [72, 73]. To describe it, let 


| ah eet 
T (a) = Te + ahi (ey) + cee tb ( ) UNE! 


(m —p—1)! 
(O0O<u<m-! 


denote truncated Taylor series for the uth derivative at x = x, — 2°. | 


based on the available derivatives at x = x,. Using the short-hans 


notation g(x,u,,) for g(x, Uo, Uy, --+5Uy—-1), set 


l, = glx ela (O)], 


, : Pali 
l, = el, ar Lon, T,,(7) ar ve i], 


(m —p)! 
(9.39 
: hi 2 it | l (0 < ww <mn-. 
=_ Ls} yi i ( (#4) (Hy { 1 sft 
l, el, He ON A A723) se eee i ( \lo (4 =m —l', 


hm i“ 
ly ~ el, fe hg La bye ah 
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Then 
fm-# ] 
vo —- Tl] I ee se V2] 


+2(m—p)(le+4s)+(2—m+p)4] (uw =0,1,...,m—1). 
(9.34) 
It was shown by Zurmiihl that in case v 1s a scalar function one has 


’ + 3 — O<u<m—-2 
a — M(x, +4) = OU), p(T EK USK SA 


(9.35) 


Thus, if m > 2, the function v(x) is obtained with a local error of order 
h™-3, and the order of accuracy in the sequence of derivatives decreases 
successively by 1 until it reaches O(4°) for the (m — 2)nd derivative. It 
must be noted, however, that the accumulated error of v over a large 
number of steps is nevertheless of order /4 for all m. Compare in this 
connection the remarks in [55]. | 

Further Runge-Kutta methods for (9.32), involving also partial deriv- 
atives of g, are listed in [7, 17]. 


9.5 Error Analysis 
For any one-step method we have, from (9.4) and (9.6), 


Jn+1 =D)n at he (Xadnsh)» 
J(Xn-1) = ¥(Xn) + hg (XnrV(%n) 5) =e r(X,,,/), 


where r(x,f) is the local truncation error. Thus, ife, =», —_y(x,) de- 
notes the error vector at the nth step, we find by subtraction 


Enea = €n t ALG (Xn Ins) — Pn I (%n) 54)] + enh). (9.36) 


This shows that the increase of the accumulated error is composed of two 
parts: the truncation error 7(x,,) and the contribution arising from the 
second termontheright. Itis, infact, the nature of this additional term 
which is decisive for the behavior of €, as n becomes large. 
To see this, write 
Oy Og: 
1 (So Jnit) — Un 3(%) 5h) = 3 Ln IU) = 5 


where the derivatives making up the matrix (dq¢/éy) are taken at suitable 
points on the line segment between y,, and y(x,).. We obtain then, from 
(9.36), 


€n+1 (7 BG h =) Cn aA (Xk). (9.37) 
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The behavior of e, for large n, therefore, depends mainly on the mat- 
rices J + h(dq/0y), which are all close to the unit matrix /. Ifthey have 
eigenvalues which are consistently (1.e., for all 2) larger than 1, then one 
expects the sequence of error norms |e, || to increase ultimately like a geo- 
metric progression. If all eigenvalues are consistently less than 1, then 
the error norms will remain bounded. In view of this, the first term on 
the right in (9.37) is often called the “propagation error.”” Obviously, in 
case of simple quadratures, where dg/dy = 0, there is no geometric 
propagation of errors. 

An estimation of the propagation error for Runge-Kutta methods was 
already given by Runge [53]. Suppose that,* in J x R, 

of (x, 7) | <K, 
oy 


and consider, for example, the Runge-Kutta method (9.21), with 
4 


P(x, 95h) = > ak,(x,9), 


s=1 


ki(x,9) =f (x), (9.38) 
k,(x,7) =f(x + way + atk) (s > 2), 


at, = ax, = Yaag = yy = &, 
He = P3 = V2 [kg = VY’, (9:39) 
A, = Ag = aA, = %. 


Then, for any two points (x,y), (x,y) in J x R, we have 


re] 
Pei tese |Z (5) < Ky — 5, 


: d : : 
IK, (*5.7) — k(x, 9) | _ = Ly mane a A hl Ky (%59) ~ Perse ay 
< KE lly — SI + Add Vhs a (*9) — yal, 9) i]. 
From this we obtain successively 


NAa(ay) — kala I) SACL + AQAK) Ly — Ji, 
ka(x,9) — k(x, f) | < K(1 + AghK + AgAgh?K?) ILy — Fil, 
Nky(4,9) — hye FI SACL + AGAR + AgAghPR?® + AgAgAghBK?) | — F1. 


* For a different assumption in this connection (relative to scalar differential 
equations) see {1 1]. 
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Therefore, by (9.38) and (9.39), 
ig (x, 934) — (x, 554) | < Ko, + a2(1 + A,AK) 
+ ag(1 + ARK + AgAgh?K?) + a,(1 + AGAK + AgAgh?K? 
+ AgAgAgh®K?)] |_y — jl] = K(1 + YhK 4+ Yeh? K? + Yeah3K?) |_y — JI. 


Letting now x = X,,) =Jn,j =_)(%,), we find for the propagation term 
in (9.36) the following estimation 


len + ALG (Xa Inst) — 9(%as (Xn) 54) $l 


(AK)? | (AK)® | (RK)! 
2! 3! 4! 


< (1 44K + J tell. (9-40) 


The estimation of the truncation error is considerably more laborious. 
A first rigorous estimate for the Runge-Kutta method (9.38), (9.39) 1s 
stated in [4] and proved in [5] by Bieberbach. We give it here for the 
case of a scalar first-order differential equation. In J x R, let 


Qo! M 
fee; fl < N, aaa Syei 


Axi ay 


(0<1<4,0 <k <4). 


(9.41) 
Then, for x, < X, 


Ir(x,,h)| < MN(3.7 + 5.4M + 1.3M2 + .017M3)A5, (9.42) 


Similar bounds are derived in [5] and [66] also for arbitrary systems of 
differential equations. Analogous bounds for the truncation error of 
Zurmithl’s method can be found in [18] (see also [8]). 

Under the slightly different assumptions 


af - M! 
Ox! dy*| ~~ N*O} 


(0 <1 <4) (9.43) 


(which contain, for / = 0, the assumption | f| < N) Lotkin [35] derives 
the “‘asymptotic”’ estimate 


r(x,,h) = ph® + O(A), lp] < *3420NM4. (9.44) 
The constant M in (9.43) may be obtained by forming 
ma | Od ws K max N¥-1K, = L max L}" = M 
TR | Qx'-* gyk] O<kel aia mica} 


in this order. 
Estimates which are based on a priori bounds for y®)(x), (d/dx) f,[x, 
3(x)], A,y[*,y(x)] are given in [1], and further estimates in [7, 10]. 
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The estimation (9.40) of the propagation error can now be combine: 
with any estimation 


Ir(x,,4) | < Ths 


of the truncation error to give a bound for the total accumulated errar:: 
Runge-Kutta’s method. In fact, setting 


Sci agi isda a ae ae 


op ean ie 31 ti 


we obtain, from (9.36) and (9.40), 
lena < (1 + AP) eal + TH = (n = 0,1,2,..2, 


and therefore, if «, = 0 and P > 0, 
Th 
len | ay [aan esl). 


In the light of example (9.3) it is not surprising that estimations of 
kind, being necessarily satisfied in every instance, may lead to rather cc": 
servative bounds. Forexample,ifwe apply (9.41), (9.42) tothe equaus: 


y = f(x), Son 
that is, to a simple problem of quadrature, we obtain 


Ir(xq,4)| < 3.7HA(L +++), H=max|f], 4 


nd 
Vsls4 


where dots indicate additional positive terms. On the other hand, *°: 
well-known remainder term for Simpson’s rule, to which Runge-Kutt: : 
method reduces in this case, leads to the estimate 

H*h5 


Ir(x,,4)| < 5880 ° H* = max | f™|. Os 


Comparison between (9.46) and (9.47) shows that, for quadratures, Bi:- 
berbach’s bound is too large by a factor of at least 104. Somewl=: 
better is the bound which derives from (9.44) [about 300 times too lar: 
in the case of (9.45)], but it has the drawback of being justified onlv ter’ 
sufficiently small. 

In view of these difficulties, present-day efforts tend to appraise the 
error more realistically by stochastic methods [25, 50]. In practic. 
one makes use of various asymptotic devices to estimate the accurac\. 
notably Richardson’s ‘“‘deferred approach” to the limit [51] (see alv 
[16, 21]). 
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DIFFERENCE METHODS 
-6 Linear Multistep Methods 


The methods so far discussed replace the first-order system of differ- 
ntial equations (9.1) by a first-order system of difference equations 
3.4). The local error committed thereby could be made as small as 
»(A*), but not smaller without considerable additional effort. We now 
eplace (9.1) by difference equations of higher order, which will permit 
s to decrease the local error to any desired order of magnitude. 

More specifically, we consider linear multistep methods, that is, methods 
vhich define y,,, by a linear combination of vectors »,,_ 441) Afnsp rss 

= 1,2,...,k, where 


Ie =f (ns Jn) (n =O, 2s -)s (9.48) 


Ve write these methods in the form 


Snare Jn bot Feat = A(Bofnai + Bifn Se Bia ceay)s 
(9.49) 


vhere it is assumed that «,, 8, are real constants, k > 1, |«,| + |2,| > 0, 
indh =x,,, — x, 1sindependent ofn. Once k initial vectors yo, y,,..., 
';-, are known, the relation (9.49) can be used to obtain successively all 
lesired approximations y, forn >k. Thek initial vectors must be found 
ndependently by some other method (e.g., by a Runge-Kutta method). 

It is customary to call the formula (9.49) open.if By = 0 and closed if 
35 # 0. Open formulas define the “new” approximation y,,,, explicitly 
n terms of “‘old”’ approximations, whereas closed formulas contain y,,,1 
mplicitly as argument in /,,,. In the latter case the relation (9.49) 
‘epresents a system of m, in general, nonlinear equations for the m com- 
yonents of y,,,.. The existence and uniqueness of a solution of such a 
‘ystem, and a practical method of solving it, are discussed in Sec. 9.8. 

We associate with the “*k-step”” method (9.49) the linear functional L 
Jefined by 


k 
Lw(t) = > [«,w(k — s) — B,w'(k — s)], % =1. (9.50) 

s=0 
Here w(t) is considered to be a scalar function differentiable in [0,4]. 


We call & the index of Land say that the functional L is of order pif L(t’) = 0 
forr =0,1,...,p but L(e*t+!) 4 0.* 


* In place of the powers {t"} one may consider other systems of functions {w,(t)} 
and define analogously the order of L with respect to {w,}. For the case w,(t) = 
exp (A,f) with suitable constants A,, see [6]. 
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If w € C?*! in [0,4], then, in Ta theorem, 
P a 


where ~ — Ae = max - t —u). 


Every linear functional LZ of order p can therefore be represented in the 
form 


_! ae u)w'?*))(u) d A,(u) = L[(t — u)®,], (9.51 


in which all anaes of L are collected in the first factor of the inte- 
grand and all those of w in the second factor. If A,(u) does not change 
sign in [0,4], then (9.51) can be simplified to 
Lie) 
(p + 1)! 
In fact, by the mean-value theorem, 
(p+1) k 
Iw = saan Z3 | A,(u) du 
p! 0 

and by (9.51), with w(t) = @+1/(p + 1)!, 


— (p+1) — 
Lw = [,,,w (7), bw = 


O<r<k. (9.51') 


L(t?+) 


[ anu) de end (9.52) 


The expressions in (9.51) and (9.51’) are referred to as remainder terms of 
the functional L. 

In analogy with (9.6) we define for the method (9.49) the local trun- 
cation error at the point x to be the vector 


r(x,h) = y*(x,h) — p(x + A), 
where y* is the solution of 
k k 
y+ Zaye —sh +h) — Al Bof le + hy*) + 3 Bix" —sh+ h)| =0. 
8= s=l1 


If the functional LZ in (9.50) has order p, then our method (9.49) is of 
order p in the sense of (9.7). More precisely, it can be shown that if 
f €C*, then 


llr (x,A) \ — ldnaal IL yP 42 (x) | Apti + o(h?t), 


Multistep methods of a more general form than (9.49) have also been 
considered. Some of them make use of certain “advanced’’ points 
(Xn4n)no1))/ > 1, others of partial derivatives of f. Fora study of these 
we refer to [9, 15, 33, 44]. 
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9.7 Multistep Methods of Maximal Order 


We turn now to the question of determining suitable functionals (9.50). 
In order to obtain a functional L of order p, we must have 


k k 
L(t) = Ya,(k — 3)" —1 DB (k—5)4 = 0 (r=0,1,...,f)- (9.53) 
s=0 s=0 
Since a = 1, these relations represent an inhomogeneous system of 
(p + 1) linear equations in (2k + 1) unknowns a,, 8,. Ifp < 2k, such 
a system has always infinitely many solutions, so that there is an 
infinite number of functionals Z with index & and order p < 2k. On 
the other hand, it can be shown that (9.53) has a unique solution if 
p = 2k and no solution if p > 2k. The corresponding functional L of 
maximum order can be constructed in closed form by using Hermite’s 
interpolation formula (details are given in [14], numerical values in 
[56]). 

The resulting k-step methods of order 2k are mainly of theoretic interest, 
in spite of their high local accuracy. It turns out that the corresponding 
recurrence (9.49) is very sensitive to small disturbances, which are 
quickly amplified during repeated application of (9.49). This phenom- 
enon, known as numerical instability, is examined in more detail in 
Sec. 9.10. Here, we mention only that the stability properties of (9.49) 
depend on the location of the zeros of the polynomial 


a2) = 254 gz ees Se (9.54) 


which is called the generating polynomial of the functional L. In fact, a 
necessary condition for stability is that all zeros of a(z) be located within 
or on the unit circle and that all zeros on the unit circle be simple [14, 59]: 


feither |f| < 1 


if ee me ee ee ee 


(9.55) 
Following Dahlquist [14], we call a functional L (and also the corre- 
sponding multistep method) stable if the generating polynomial a(z) of L 
satisfies condition (9.55). 

From the practical point of view, therefore, the interesting question is 
to maximize the order of a functional Z among all stable functionals with 
given index. We can assume, henceforth, that Z has order p > 0, so 


that certainly 
al) =l+a2zt+at+es-ta,=0. (9.56) 


We note then, first of all, the following theorem. 
Theorem 9.1. Jf for a given generating polynomial (9.54) there exists a 
corresponding functional L with index k and order p => k + 1, then it ts unique. 
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The proof follows readily from (9.53), in which the «, are to be con- 
sidered as prescribed and the #, as unknowns. 

A functional Z of the type indicated in Theorem 9.1 always exists. 
Indeed, let us write (9.50) equivalently in the form 


k 
Liv(t) = ¥ [2,w(k — 5) — yas¥ (A) 
| k 
= a(E)E~*w(k) — & vesV'w'(k), (9.97) 
s=0 
where E£, V denote the displacement and the backward difference oper- 


ator, respectively. By means of the formal calculus of operators, we 
transform (9.57) into 


a{1/(1 — V)](1 — V)* é 
Lw(t) =)— ae — SrnV"}w'(h) 


which certainly holds whenever w(t) is a polynomial. Let 


a1 — 2) —2)* 2, - 
— In(l—z) = 2 os2 5 (9.58 | 
and define 
Yus = xs (a0 lice eh (9.59 
Then Lw(t) = ( y cus") w(K). (9.60 
s=k+l 


Hence, Lw = 0 if w is a polynomial of degree < k + 1; that is, L is of 
order p >k + 1. 
If L is stable, it can be shown, moreover, that ¢, ,.; # 0 unless k is 
even and 
Oo, , = —a, (s =0,1,..., 4; k even), (9.61) 


in which case ¢, 44, = 0, Cyysg 4 0. In view of (9.60) and Theorem 
9.1, this implies the following result. 

Theorem 9.2 (Dahlquist [14]).. The maximal order among all stable 
functionals L with index k is equal tok + 1 ifk ts odd and is equal tok + 2ifk 
1s even. 

It is characteristic of a stable functional of even index and maximal 
order that its generating polynomial a(z) has the zeros z = land z = —| 
and that all other zeros, if any, are on the unit circle arranged in con- 
jugate pairs. In fact, (9.61) is equivalent to z*a(1/z) + a(z) =0; 
hence a(1) = a(—1) =0, and a(¢) =0 implies a(1/Z) =0. Since no 
zero is allowed to fall outside the unit circle, we have |f| = 1 for each 
zero ¢, and then 1/f = @ is a zero whenever € is. 
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Conversely, if a(z) has the zeros 


C= hy, bo = aed, t, = &, =e” 
ee rere Pa, < 6, =a) 


k] 


te 


then Q( 2) = (22=— 1) [z? — (2 cos 9,)z + 1], 
so that - 
k/2 
zta(1/z) + a(z) = (1 — 2?) [J [1 — (2 cos 0,)z + 27] 
r=2 


k/2 
+ (z* — 1) [J [z? — (2cos 9,)z + 1] =0. 
r=2 


The gain in order, if & is even, can be explained by the existence of an 
expansion of L in central differences, 


k/2 


lw = 2 x,w(k — s) -3- Ty, OF'w’' (12k) (Keven; a,_,= —2,), (9.62) 


where the coefficients +,, are found from* 


a xn oes “= =— | 4.4 7 4% 22 +- zVv1 +42 + 422 7 (9.63) 


Formula (9.62), for example, contains for k = 2, a(z) = 22-1, 
Simpson’s formula 

w(2) — w(0) — ¥4[w'(2) + 4w'(1) + w'(0)] = —Mow')(z), (9.64) 
and fork = 4, a(z) = 24 + z3 —z — 1a formula due to Dahlquist [14]: 


w(4) + w(3) — w(1) — w(0) — 3[w’(3) + w'(1) 
+ Mo 64w'(2)] = —Maow")(7). (9.65) 


The remainder terms are obtained from (9.51’), which is applicable in 
these cases, as can be verified. 

Stable functionals Z of maximal order are always closed [14]. Open 
functionals of index & and order & can be constructed by a formula anal- 
ogous to (9.57): 


k k—1 
Lw = Yawk —s) — do,,Viw'(k — 1), (9.66) 
s=() s=0 
, (Pasa ili 2) |) Be ag 
where = ee = 2 Ou? ‘ (9.67) 


* Note that the Icft-hand side in (9.63) is an even function of z, since u(—z) = 
IJu(z) and 
a(1/u) —uka(1/u) a(t) 


uF In(1 fu) uF) In ~ uP) In 
by virtue of (9.61). 
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For example, if k = 4 and a(z) = z* — 1, we obtain from (9.66) and 
(9.51’), 


w(4) — w(0) — %[2w'(3) — w'(2) + 2w’(1)] = *Yasw(7). (9.68) 


In view of the representations (9.57) and (9.66), multistep methods 
are often called finite-difference methods. Extensive lists or various ex- 
amples of difference methods can be found in [22, 48, 65, 69]. The 
classical difference methods are based on the functional LZ with a(z) = 
z* — z*-1 and are discussed in more detail in Sec. 9.9. 


9.8 Predictor-Corrector Schemes 


We return now to the question of solving the equation 


Inty + Un Foe + Ones = A[Bo f (Xn+1In41) 
1 Bitn aie eens Se Bi fn—eitl (Bo 0) (9.69) 


for the vector y,,,, assuming that the vectors 7,, Jn_1,---3 Sao tne °° 
are known. We shall show that, for & sufficiently small, (9.69) has a 
unique solution, which can be obtained by a method of successive 
approximations. 
Theorem 9.3. If 
[Bol KA < 1, (9.70) 


K > 0 being a constant satisfying (9.2), then Eq. (9.69) has a unique solution 
Pari Lf we define the iteration 
ee i CiVn ea a Oe Vn—k+1 = ALBof (Xn+v In e1) 

aad 0 i ae + Bi fn—nsi] (v = 0, 1,2,...); 


(9.71) 

then y'"), + ¥,,, as v > 0 for any initial vector y',. Moreover, 
(1Bo| KA)” a 
linda —Jauill ee Ny —aehill. Cre 


(It is tacitly assumed that y"°!, € R and that, with any 7"! € R, also 


yl’ © R, where R is a closed set in euclidean space E,,.) 
Proof. Define the operator 


Tu = ABoS (Xn+154) aa > (coy eee _ AB, fn—s+a)s 


Rad 


which maps R into itself, and let ~ = |B)| KA; by assumption, pu < I. 
Then, for any two vectors u and v of R, we have, by (9.2), 


| Tu — Tr) = Whpolf (*n4a%) —S (Xasv%)] i <4 lle — el, 
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so that Tis a contraction dedi For such operators it is well known 
that the iteration »!’*)! = Zyl?! ,, that is, (9.71), converges to a unique 
solution y,,, of (9.69) for which (9.72) holds. 

Practical use of Theorem 9.3 is made in the predictor-corrector methods. 
These consist in “predicting” an initial approximation 7"! , to_y,,, by 
means of an open multistep formula and “‘correcting”’ it then successively 
by means of the iteration (9.71), based on a closed multistep formula. 
If ZL, and L, denote the functionals corresponding to the predictor and 
corrector formulas, respectively, then they are usually chosen in such a 
way that LZ, and L, have the same order p but that LZ, has a smaller re- 
mainder term than L,; that is, 


order L, = order L, = f, [L,(t?*1)| > [Z,(t?t4)|. (9.73) 


As an example of such a predictor-corrector scheme, we mention 
Mulne’s method [40], which uses the functional in (9. ma for L, and the 
functional in (9.64) for L,: 


ee = Ja zs “ah(2f,, ae ae 5 er (9.74) 
eee = Dae 1 is Yah f (x een ene 1 ie + fri] (» i 0, l, 2, ee =) 


Here, p = 4 and L,{t5] = 1134, L,[#] = —%. Other pairs of pre- 
dictor and corrector formulas are given in the next section. See also 
[24, 57], for alternative formulas, and [30], for a procedure of changing 
the step length in Milne’s method. 

The number of iterations required to obtain y,,, witha given accuracy 
is seen from (9.72) to depend, in part, on the size of KA. Since this 
quantity should be kept considerably less than unity, not more than two 
or three iterations should turn out to be necessary. If more are needed, 
there is evidence that the step length / is too large. 

We note also that the relations 


By aay. = wl, + ABoL f (%naa eta) =f (Loca) (y = 1) (9.75) 


can be substituted for (9.71) to compute the second iterate and all higher 
iterates. 

The predictor-corrector scheme described above for systems of first- 
order differential equations applies also to a single differential equation 
of order m > 1, 

PN OOO oh eg (9.76) 


if (9.76) is transformed in the usual manner into a system of m first-order 
equations. One would predict then the unknown function v and all its 
first (m — 1) derivatives, before any corrections are applied. In prac- 
tice, however, one predicts only once, namely, the value v'" >", by a pre- 
dictor formula P, say, and from then on uses a corrector formula C to 
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obtain ">, v5, ..., Una, in this order. The values obtained are 
then inserted into the differential equation (9.76), giving uv, , after which 
v™ >!) can be recalculated by the corrector formulaC. If this new value 
differs from that obtained previously by P, the cycle is repeated. 


9.9 Adams’ Method 


Difference methods which correspond to the generating polynomial 
i2) = 2 S24 (k > 0) 


are always stable in the sense of Sec. 9.7 since a(z) has the simple zero 
z = 1 and all remaining zeros are zero if k > 1. The corresponding 
open and closed methods of index & and maximal order are called Adams’ 
extrapolation method and Adams’ interpolation method, respectively. The 
latter is of order k + 1, by virtue of Theorem 9.2, whereas the former 1s 
of order k, as follows from (9.66) and Theorem 9.1. 

The functional ZL, for the extrapolation method is obtained from (9.66; 
and (9.67), pA 
Dw = w(k) — w(k — 1) — Yo,Viw'(k — 1), (9.77) 

s=0 


where the constants o, are determined by the expansion 


Zz 2 ‘ 
==(\ 2) In (l= 2) = 292 ' (9.78. 


Here, the coefficients no longer depend on k, which has the practical 
advantage that the index and order of L, can be simultaneously increased 
by simply adding more terms to the sum in (9.77). 

Adams’ extrapolation method of order & can thus be written in the 
form 


k-1 
Vieng a), SE ON (9.79 
s=0 


The coefficients o, allow the following representation as a definite inte- 
gral: 
1 Bs oo 
oc, -| (' sid 7 7 ne Cee eee (9.80 
0 


In fact, since L, is of order k, we have, for k > 1, 


voller )J-Ll lefts 


. 
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where the well-known relation 


was used. Hence 


and since f is arbitrary, 


k k-1 k+1 t k t 
waza zeal (4h kei) 


\ 


LUG) (eh nea 


which proves (9.80) fors >1. Ifs = 0, then og = 1 follows directly 
from (9.78). 
As to the remainder term of Z,, it can be shown that 
Lw = o,w"*t))(7), O<7r<k. (9.81) 


For the interpolation method we obtain the corresponding functional 
L, from (9.57) to (9.59), 


k 
Low = w(k) — w(k — 1) — dy, Vw’ (hk), (9.82) 
8=0 
za a 
‘h —————_—— = . 9, 
where SG) De (9.83) 
By an argument similar to that following (9.80) one finds 
: _* 
if -| (‘** *) de GeO ad: (9.84) 
0 
Also, the remainder term of LZ, can be written in the form 
Low = y,z,,w'*t?)(7), O<r<k. (9.85) 


If we increase by 1 the index (and order) of the functional Z£, in (9.77), 
we can combine it with LZ, to forma pair of predictor and corrector for- 
mulas (both having orderk + 1). This leads to the following method: 


k 
Ta =)n as h > o.Vfi5 
“0 (9.86) 


k 
bie =J)n i h LV fo (y a 0, l, Z a8 2) 
where fl"), = f(*n41,701,). It is easily verified from (9.80) and (9.8+) 
that 
lyssal — Ons) (s = 1), 
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so that the criterion (9.73) for a proper choice of predictor and corrector 
formulas is fulfilled. 

Adams [3] derives both formulas (9.86) by integration of appropriate 
interpolation polynomials. He does not use the first formula but pre- 
dicts y\"! , by extrapolation of the last difference retained and successive 
“‘advancing”’ of the difference table. 

As indicated in (9.75), the calculation for the iteration in (9.86) can 


be shortened by using 


ee = PA, a) fory > 1. 


In the case where »'0} , is predicted according to (9.86), also the first iterate 
can be obtained directly by means of the formula [58]: 


Baa = hy ,V** Lae (9.87) 


In order to use (9.87), however, one has to carry along an extra column 
of the (k + 1)st differences. 

A detailed error analysis for Adams’ method was first given by von 
Mises [41] and since then has been the subject of numerous investigations 
(e.g., [2, 37, 39, 43, 61 to 64, 67, 68]). We refer the reader to these 
original papers or to the treatments in the texts by Hildebrand and 
Collatz. 


9.10 Stability of Difference Methods 


In numerical work it is usually impossible to carry out all required 
calculations with unlimited precision. One is forced to round numbers 
to finitely many figures and, if infinite processes are involved, to reduce 
these to finite ones. ‘Thus, in the case of multistep method (9.49), the 
actual results z, one obtains do not satisfy (9.49), that is, 


k k 

DH Inst-s =h DB S(Ens1-wInsi-a)s (9.88 
but rather i : 

2 %s2ns1-s = h DBS (nt1-s92n-1-s) ss Tn (9.89) 


where the vectors 7, are “small” innorm. Ifthe initial vectors z,(« = 0, 
,& — 1) differ from », only slightly and if all r, are sufficiently 
small, one expects that the vectors z,, will also differ only slightly from_y, 
for alln. Regarding (9.89) as a perturbation of (9.88), one may then 
say, loosely speaking, that (9.88) is stable if it 1s indeed insensitive to 
small perturbances r,. 
In the following we consider a somewhat stronger notion of stability, 
which is uniform with respect to A in the sense that, if z, —_y, and r, are 
suitably restricted, stability takes place no matter how small h. Since 
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(9.88) will generally be used for n = k — 1, k,...,N—1, with NV 
such that x) + Nh < X, this implies that we shall have stability for 
arbitrarily large N. 

We say that the perturbations in (9.89) are of class C(6) if, for some 


hy > 0, ph gh 
dz. — 9,l1 + > Irall <6 (9.90) 
x= Nauk —1 


uniformly for all k < hy and all N with x» + NA < X. For example, 
perturbations such that ||z, —_y,|| < 6/24, |lr,|| < 64/[2(X — x,)] are of 
class C’(6). 

The multistep method (9.88) is then called stable with respect to a 
class l of functions f(x, y) iffor each f € [and for any e > Othereexistsa 
d(e«) > O such that the errors produced by perturbations of class C(6) 
satisfy ||z, — 7, || <e(n =k, k +.1,...,N) uniformly for allh < hf, and 
all N with x» + NA < X.t 

In the following we consider the class '* of functions f continuous on 
I x Rand satisfying (9.2). 

Theorem 9.4.¢ A multistep method (9.88) 1s stable relative to the class 
['* if and only if condition (9.55) is fulfilled. 

Before proving Theorem 9.4, we establish the following useful lemma 
on linear nonhomogeneous difference equations with constant coeffi- 
cients. 

Lemma 9.1. Let ¢, satisfy the difference equation 


k 
D Miensis = §n (n ie I, %% = I) (9.91) 


and denote by H,.(« =0,1,...,4 — 1) the solutions of the corresponding 
homogeneous equation (with g, = 0) satisfying the initial conditions 


H,,=6, (n=0,1,...,k—1). (9.92) 


(6,, 15 the Kronecker symbol.) Then 


k-1 n—-1 
en = > Aes a > fc Ay een ge (n 2 k). (9:93) 
«x«=0 1 


v=k— 
Proof of Lemma 9.1. Itis clear from (9.91) that e, must be of the form 
k—-1 n~—1 
e.=> Koe:- D>. L8, (n > k). (9.94) 
«=Q v=k—1 


+ For other definitions of stability, see, for example, [34, 49]. 
+ Theorem 9.4 is essentially due to Dahlquist [14]. We follow here, with minor 
deviations, the exposition given in [27]. 
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By considering the special case where all g, = 0 and e, = 6,, for some 
fixed 4,0 <A <k, we infer from (9.94) that e, = X,, and from the 
uniqueness of the corresponding solution that 


R20. (GSP =01.2, Ral: 


Considering next the case where g, = 6,, for some fixed u >k — | 
ande, = Oforn = 0,1,...,4 — 1, we conclude from (9.94) that L,, is 
a solution of k 

> Oenti—s = Ony (n 2k — 1) (9.95) 
&=0 
satisfying the initial conditions 
= 0 (= Uy lise er) (9.96) 


For all n < v, (9.95) is homogeneous, so that L,, = 0 (n < v), because 
of (9.96). Setting n = » in (9.95), we find L,,,,.= 1. For n>, 
(9.95) is again homogeneous. Hence, L,, is a solution, for n > », of 
the homogeneous difference equation associated with (9.91), satisfying 


| ee = | ree ait 15 ee 1 = 0, | ee = l. 
Since H/,.,-,-2%-1 18 another such solution, it follows that 
figs es As h—v-2,k—-1" 


Lemma 9.1 is proved. 


Proof of Theorem 9.4. Let 
6. 2.7, cen tn ere 


Subtracting (9.88) from (9.89), we find 


k 
> XK eCnsi—s = Ens 
a —0 
En a h 2X BLS n+1-m2ns1-s) =f Naa acs) aif Ty 
8= 


Thus, each component ef of e, satisfies a difference equation (9.91) 
with g! as inhomogeneous term. 

To prove the necessity of (9.55), we need only consider the particular 
case where f(x,y) = QOandr, =O(n >k—1). Theng, = 0, and eis 
a solution of the homogeneous difference equation associated with (9.91). 
The general solution of this equation is a linear combination (with, in 
general, complex coefficients) of the & solutions 


(9.97) 


Cn ss (7 = 0, 1,...,9, —13i=1,2,..-,9), (9.98) 


where the ¢, denote the distinct zeros of the generating polynomial afz) 


and the qg, denote the respective multiplicities. Suppose (9.55) is not 
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fulfilled. Then either |f;| > 1 for some z, or [Z,| = 1 and 7 > 0 for 
some i andj. In both cases the corresponding solution (9.98) is not 
bounded in absolute value as n — oo. It can always be made a com- 
ponent of e” by a suitable choice of the initial values e#. Since, further- 
more, ef does not depend on f, we can choose /:so small and n correspond- 
ingly so large as to make |e?|, and thus |le,||, as large as we please, no 
matter how small the initial errors are. This contradicts stability as 
defined above. 

To prove the sufficiency part of the theorem, we assume that (9.55) 
holds. Then all solutions (9.98) are bounded as n + oo. Since the 


solutions H,, of Lemma 9.1 are linear combinations of the solutions 
(9.98), it follows that 


IH,|<H (n>0,« =0,1,...,k —1) 


ne 


forsomeH >1. From (9.93), isolating the term with »y = n — lin the 
second sum, we then have 


k-1 n—2 
lil < A( Shell + lgial +." etl). 
«=0 v=k—1 
Summing over p, 


> le <H(S Y letl + 3 let ua S Lie ). 


w= «=0 w=1 


Since |u| < > lu“| < a/m l|z|| for any vector u € E,,, we find 


mo) 
a= 


lel < HV’ (Sle + heal +" del). 
From (9.97) and (9.2) we obtain the estimates 
gn—all < ABK lle,|| + eK llen—sll + llta—alls 
lg-ll < BK Y Neral + llr, ll, 


where f = max |f,|, so that, after some grouping together, and further 
8 


enlargement, 


= _n-l 
(1 — hABKHV m) \e,|| < ABKH(k + 1)V'm ¥ |e, 
y=0 


jos, [RO 1 n -1 
+ HV m ( > lel + > inl). 
x0 y-k-1 
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Assume now that h < hy, 1 — h,BKHVm > 0 and that the pertur- 
bations in (9.89) are of class C(6).. Then 


n—-1 
lel SAAS lel + Bd (n= 1,2...) (9.99) 
where A = BKH(k + 1)V'm 0 Hv'm 


= > 0, B = —————— > l. 
1 — A, BKHVm 1 —h, BKHV m 
Observe that the difference equation 


n—l 
E, =hA> E, + Bd Ce a eee E, = Bd (9.100) 
y=aQ 
has the solution 
E, = Bé(1 + kA)". 


Obviously, |e] < #9, and subtracting (9.100) from (9.99), we get 
n—l 
lle,| — EB, <A 2 (leek —£,) (n =1,2,...). 


Therefore, by induction, |le,|| — £, <0. Thus, forn < N, 
lle, < E, < Ey < Bo(1 + hA)%-704 < BéeA(X—*0), 


Choosing then 6 = B-e¢~4(*~70¢, we obtain |le,|| < ¢«, which proves sta- 
bility. ‘This completes the proof of Theorem 9.4. 
As interesting as stability is the question of convergence, that is, the 
question of whether max ||z,, —_y(x,) || + 0 as 4 + 0 whenever 
O<n<N 


k—-1 N-1 
Elz. —y(%) + Mall = OCA). 


Convergence in this sense implies stability. Conversely, it can be shown 
[14, 27] that every stable method of order > 1 1s also convergent. For further 
results along these lines, see [20]. 

A stable method, though producing satisfactory results for 4 sufficiently 
small, may very well fail to do so if Ais not small enough. This is likely 
to occur if the solution of the differential equation decreases exponen- 
tially, whereas the approximating difference equation has solutions which 
increase exponentially. Instabilities of this weaker type are studied in 
[12, 45, 49, 54, 70, 71). 
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10.1 Introduction 


Orthonormal sets of vectors or functions have for years played an 
important role in many theoretical discussions of algebra and analysis. 
They are of great use in matrix theory, approximation theory, differen- 
tial and integral equations, boundary-value problems of mathematical 
physics, etc. In short, there is hardly a region of linear analysis in 
which the employment of orthonormal systems does not lend great 
simplicity and elegance to the theory. Despite this fact, the use of such 
sets for the purposes of numerical analysis has been thus far quite 
limited. The reason for this is that the algebraic features of the ortho- 
normalizing process are somewhat involved when only hand computa- 
tion techniques are available. However, the current availability of 
high-speed computation machines with a reasonably large memory 
capacity has altered this situation substantially, and orthonormal 
systems should, in the near future, become part of the stock in trade 
of every numerical analyst. 

Orthonormalization codes can be written with sufficient generality 
and flexibility so as to be immediately utilizable in a wide variety of 
problems. With small changes in input and appropriate interpretation 
of the output, such a code can, in the hands of a competent numerical 
analyst, be made to tackle problems of seemingly diverse natures. It 
will have the precise advantages and disadvantages of any multiple- 
purpose tool. ‘The purpose of this chapter is to discuss the manner in 
which such an orthonormalization code can be set up, to outline the 
variety of problems that it can handle, and to describe a number of 
concrete problems that have been solved in this way. 
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10.2 Orthonormal Vectors and Expansions 


We deal with vectors f of dimension N possessing components 
Diy +++ yJy- We employ this notation inasmuch as we are frequently 
concerned with the situation in which the components 9, are the values 
of a function f at a set of points x,:7, =/(x,). Ifa vector f, has com- 
ponents y1;,-.--,);y and a vector f, has components 7.;,.--, ¥ey, We 
introduce the expression 


N 
(fife) = 2 Widn Dox 10.1) 


as an inner product. The uw, are a set of nonnegative weights which 
are considered fixed throughout this discussion. As the definition of 
the norm of a vector f, we take 


If = (£/)*. 10.2) 


In the case of complex-valued components, we take as the definition 
of the inner product 


V — ET ETT . es 
(fi Se) = 2 WSJ oe (x+y) =x-—y. (10.17) 
The most general inner product for vectors of dimension N can be 


written in the form 


(fife) = (fi) (W) (fe): (10.1; 


where (/;) indicates the (1 x M)-row vector (),,..-5.¥,.) and the 
prime denotes the transpose. Here W = W’isa fixed positive definite 
matrix of order VN. There have been a number of problems in statistics 
and eigenvalue theory in which it has been essential to use the inner 
product (10.1"). 
A set of vectors 4), ¢o, . . . is said to be orthonormal with respect to an 
inner product if and only if 
0 ify Ak | 
($;,9;) = On = ae (10.3) 
] if7 =k. 
The components of an orthonormal set ¢,, ¢9, . . . are designated here by 
2x1) Zeg9 ++ +5 Zy for k= 1,2,.... In terms of a set of orthonormal 
functions ¢,,...,¢,, 2 < N, a given vector / possesses an orthonormal 


(or Fourier) expansion 
n 


f~d (Sibelbe =f" (10.4% 
with a discrepancy given by 


o=f— > (Stet. (1055 
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The vector 6 has components 6,, 6,,...,6,. For each n < N, the 
Fourier expansion (10.4) possesses the familiar least-square property 


= min (10.6) 


ii =| - 3 (Sdde 


from among all the possible approximations of f of the form 


n 
> AP: 
k=1 


Let fi, ...,J, be aset of n vectors (n < N) which are assumed to be 
linearly independent. The object of the Gram-Schmidt orthonormali- 
zation process is to produce a set of linear combinations of the vectors 


Si <7, mY ie 
$1 = 1; 


be = 4 f; + deofe 
be = Ay fi + G32 f2 + Q33f3 (10.7) 


such that the set ¢,, ¢.,... are orthonormal. If this is accomplished, 
the Fourier expansion (10.4) may be expressed directly in terms of the 
vectors f,: 


I~ (fibelbe = 3 Libs) Sau 


(10.8) 
=3 |S todeu | = Saf 
where d; = S (Sb) aes PN 2 cary Ms (10.9) 


It is easy to see that the cocfficients d, solve the problem of minimizing 
||\| from among all the possible approximations of the form 


n 
> ax fy 
k=l 


10.3 The Gram-Schmidt Orthonormalization in Recursive 
Form 


The Gram-Schmidt process may be put into the following recursive 
form, which is convenient for machine work. Let f{,...,/, be the 
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n given vectors to be orthogonalized. Set 


> 
| 


$2 = 


and, in general, 


D, = ((fe Se) — fot l® — Seba) PF — ++ — Ate FT" 
c oe + Ak ( fisP1) 
lk D, 
c __ ( fisPo) 
2k D, 
(10.144: 
c ee (fis Pr—1) 
k-l.k D, 
l 
Che = D, 
De = Cyehy + Copy +0 + Cpe hea + Cord ke 
It may be verified immediately that 
(:,4,) = 04; 
We shall also give a second form of (10.10). 
Wy =f; 
D, = (YP)? (10.1 1a 
$, = y/D, 
Ys = She — ( fosPi) Fy 
Dz = (Paz) (10.118, 
bd, = Y2/De 
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D, = (fpf)? 


bet 5 (10.102 
$, = tuft 

= [( fo fe) — (fos) |?) ? 
(far) 
| D, (10.108: 
D, 


Cyoh, + Coo fos 
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and, in general, 


We = fi, — (fibi)br — (fish2)b2 — °° * — (fisPr—1) Pe -1 
Dy, = (YerPx)” (10.11) 
dy = y/D,, 


The vectors y, are orthogonal but not necessarily normal, whereas the 
¢, are orthonormal. 
The auxiliary quantities D, are interesting in themselves in view of 


the identity 
(D, D, ae. + D,,)? Fon Gl fis fe» aria a 
(firti) 0+ (fifa) 


= ae (10.12) 


(fuoJSt) sends esa) 
The determinant G is known as the Gram determinant of the system of 


vectors, and G* = G( fi, fa,--- SMAI? WA? > +> WAL? is a “mea- 
sure’ of the linear independence of the vectors f,,...,/,. We always 
have 


0<G* <1. (10.13) 
The lower value is attained if and only if the vectors f; are linearly 


dependent, and the upper value is attained if and only if the vectors are 
orthogonal. A second expression for G is 


G( fi Ses ees Sn) = (411422 east Cy i (10.14) 


In the special case where n = N, we have 


Gh Ses oe ie sti = [det (Sis Sos 2 al (10.15) 
and so 


det( fis fos ++ +sfn) = DiDa- ++ Da = (411422777 Gny)7*. (10.16) 
10.4 Coding with a Matrix Multiplication Subroutine 


It is possible to avoid excess coding by using a matrix multiplication 
subroutine. The scheme described in this section is due to E. Hayns- 
worth. It 1s assumed that we have available a subroutine which will 
multiply an m x n matrix stored in location 4 by ann x n matrix stored 
in location B and store the result in location C. The locations and 
dimensions must be specified at each multiplication. The central and 
recurrent feature of the orthonormalization scheme is the construction 
ofa vector of the form 


g* cae a (2,01) oy _ (g,b2) be oo ee (2,0,) Px (10.17) 


Where k = 1, 2,..., 2 and g may be any one of the vectors fj,..., fis f 
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This form also appears in the correction formula (10.26). We have 


OW) (burda eo os Fe) = [ledds (Gbads «+s (Gbe)]- (10.18) 
Furthermore, 
ae 
(gtr), (@da)s +++s (&be)] (Fo) = (edit +--+ + (abet 
-, > 


Hence, if we designate the N x & matrix (4), do, ..-, $,) by ®,, we 
have 
gt = g(I — WO,%). 
We also need (g*,g*), and this can be expressed as 


1xNn NxN Nxil 


(g* +) (W) (g*}) = (g*,8*). (10.19) 


The coding may now be based upon these identities. 

The coding of a complex orthonormalization process follows the 
same pattern as that of the real case, with the obvious modification of 
working with complex numbers in the form of their real and imaginary 
parts. We must insist, however, that the weights in (10.1) be non- 
negative, or that the matrix W in (10.1”) be positive definite; otherwise 
(y,,y;) need not be > 0, and the extraction of its square root becomes 
meaningless. 


10.5 The Orthonormalization of Functions 


In theoretical discussions of orthonormal functions, the inner product 
is usually found to be of the form 


(fe) = | w(x) f(x) g(x) de (10.204) 


in the real case and 


(ag) = | ws) fla) g(a) (10.208) 


in the complex case. Here w(x) is a fixed nonnegative weighting 
function, and x is a real variable which ranges over a certain interval. 
Double and multiple integrals also occur. For numerical work, it is 
“most convenient to replace the above integrals by an appropriate rule 
for numerical integration. That is, we introduce a fixed set of abscissas 
X1,+..,%X, and assume that 


(fe) = { w(x) fle) glx) dx = Dw fx) g(x) (10.21) 
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is sufficiently accurate for the problem in question. A function, 
therefore, is represented by a vector of dimension N whose components 
are the values of the function at x,,...,x,. In order to avoid linear 
dependency, the number N of points x, must be selected at least as large 
as the number 7 of functions dealt with. The constants w, must account 


for the weighting function w(x) as well as for the integration process 
itself. 


10.6 Scaling 


Overflow in the computation of the inner products (/,¢,;) can be 
avoided by individually scaling down the /, sufficiently, since the scale 
factor of any /; does not affect the orthonormalization process. A 
second source of overflow occurs in the computation y,/D,, inasmuch as 
the orthonormal functions may take on large values at certain points of 
their range even though their value in the mean is one. This must be 
expected in general; for instance, in the case of the normalized Legendre 
polynomials P,(x) we have P,(1) = (n + 4)%. This situation occurs 
regardless of how the /, are scaled and can be corrected only by scaling 
up the weights w,. When doing this, we must be careful to scale down 
the f; to avoid overflow in the computation of the inner products. This 
scaling should be carried out, not in the code itself, but when the 
specific data are prepared for insertion. The effect of scaling up the 
weights by a factor k is that the orthonormal functions are scaled down 
by a factor 1/Vk. This device has the limitation that there is an upper 
limit to the quantity & max w, in fixed-point machines. If the weights 
are initially chosen so that there is a wide spread in their values and 
some are close to the largest number the machine can handle, then k 
cannot be very large; hence the scaling down of the orthonormal 
functions is negligible, and overflow occurs. In addition, any scaling 
down decreases the number of significant figures available for the com- 
putation. These scaling problems can be avoided by using floating- 
point routines, at the cost of much more time spent in computation and 
less storage capacity left for data. 


10.7 Round-off 


Round-off occurs principally in the computation of y,/D,, because, 
when k is large, the vectors y, and D, both become small. In adverse 
circumstances it may be so severe as to produce a meaningless computa- 
tion. Round-off may show its effects in several ways: the theoretically 
orthonormal vectors ¢, become less and less orthonormal as k becomes 
large; quantities in the brackets in (10.10c) may become negative, for 
large h, contrary to the Bessel inequality. A very good way of spotting 
round-off is the following. Suppose we are approximating f by 4, 5, 
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6,... functions f,. For each n, there will be computed the norm 
||6,,|| of the least-square approximation of f by n functions f;. Theoreti- 
cally, ||6,|| is monotonically decreasing, but it may be found that, after 
decreasing for a while, it actually increases! This indicates the pres- 
ence of serious round-off, and the computations should be stopped at 
this point. 

A method for alleviating some of the effects of round-off consists in a 
progressive “‘straightening out’’ of the orthonormal vectors. Let us 
suppose that we have a system of n vectors ¢,, $a, - - - 5 $n —1) Oy Of Which 
the first (n — 1) are substantially orthonormal, 


(4,4) =6;, (j7 =1,2,...,2 —1), (10.22) 


whereas the nth vector ¢, is normal but is slightly nonorthogonal to the 
first (n — 1) vectors (these last two conditions actually occur in prac- 


tice): 
(S03) = G21 2ea.or 1), (10.23) 
(PnsPn) = (10.24) 


Pn — Pn + h, 
where ¢, is the true (improved) nth orthonormal vector and A 1s a 
correction vector whose norm is assumed to be small. Expanding A in 
its Fourier series, we have 


h = by — By = 3 (bibs) by + (haba) bo (10.25) 
From (10.23) we have 


Write 


(Pn — A, b;) = & 
(Pushi) — (Asb;) = €; 


(4,6;) = —6; = —($5$n)s J=1,2,...,2—1. 
From (10.24) we have 


or 


or 


(¢, —h, $, — fh) = | 
or, neglecting (/,f), 
(A,¢,,) = 0 
Thus, = = 4, =. = ($501) $; (10.26) 


gives us a formula for proceeding cone a first approximation ¢, toa 
better approximation ¢,,. 


10.8 Input and Output 


The Computation Laboratory of the National Bureau of Standards 
has developed five orthonormalizing codes. Codes I and II, written 
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for the SEAC by P. Rabinowitz, are fixed-point codes, real and complex, 
respectively. Code III, written for the SEAC by J. Bram, is real and 
in floating point. Code IV, written by E. Haynsworth, incorporates 
the general inner product (10.1”) into Code III. Code V, written for 
the IBM 704 by E. Haynsworth, is in floating point. Codes I to IV 
are in single precision, yielding 11S for the fixed-point routines and 8S 
for the floating-point routines. Code V is in 704 single precision, 
yielding 8S. 

The codes, as developed, read in N weights, N values for each of the n 
functions f;(n < N) and N values ofa function f. Then they compute 
and print out the N values of each of the orthonormal functions ¢, 
and the residues 6. In addition, they can print out the n(n — 1)/2 
inner products (/f;,¢,)(t <j), the n values D,, and the n values (f6¢,), 
which are the Fourier coefficients in (10.4). This material is sufficient 
in many cases. However, at other times, it is desirable to have the 
coefficient d; in the expansion (10.8) and the coefficients a,; in the 
expansions (10.7). These can be obtained without additional coding 
by using the same code and augmenting the input data as follows (see 
Fig. 10.1). Read in (N + n) weights where the last n weights are zero 
and (N + n) values for each of the vectors f, where the last n values of 
the augmented vector f, are equal to 6,,(2 = 1, 2,...,2) and (N + n) 
values for f where the last n values are zero. The orthonormalization 
procedure will then give the values a;; and —d; in addition to everything 
else mentioned above. Since we know that f, = D.d; — > (4;,fi)9¢3; 


* . 


t 
we have a complete description of the relationship between the f; and 
the ¢,; and hence of the relationship among f/f, f;,, and ¢;.__ It is conven- 
ient to have key-word insertions which control the information printed 
out, since this may vary from problem to problem. 

Orthonormalizing codes must frequently be employed in conjunction 
with auxiliary data-preparation codes, which prepare the input data. 
As we see further on, the inputs may consist of successive powers of one 
fixed vector, of the values at a set of fixed points of a sequence of har- 
monic functions, or of solutions of certain linear differential equations. 
The most frequent input, at least from the point of view of production 
computation, is the sequence of powers f;: (x,’, x2’, .--,%X.’),7 = 9, 
1, ...,m, and it was found convenient in developing Code III to have 
this input inserted automatically by means of a key word. 


10.9 General Augmented Inputs 


We have pointed out that, by augmenting the vectors f,; with an 
n X n unit matrix, we obtain pertinent information without additional 
coding. What do we obtain when we augment (/;) with an arbitrary 
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Vector Input 


Weights f; Ie eS Fs f 


wy Ju J21 a ni J} 
N 

ns Jin Jon Inn IN 

0 ] 0 0 

0 ] 0 0 
n 0 0 0 0 0 

0 0 0 cee ] 0 


Vector Output 


$y) do - Pn ) 


“11 “91 ‘oun zn} 0, 
N 
Z1N ZoN enn oN 
ayy ao, Qn, —ay 
0 29 ng —dy 
0 0 ee Qnn —4n 


Fic. 10.1 Illustrating use of augmented 
vector inputs. 


f xX nmatrix T and the vector /f with an arbitrary vector g? Designate 
the N x n matrix (fi, fo,.--5/,) by F, the N x n matrix (4), %9,.. -, 4, ' 
by ¢, the n x n matrix* (a,;) by A, the augmented portion of the 
orthonormal output matrix by £, and the augmented portion of the 6 
vector by A (see Fig. 10.2). Then we have 


TA = E. (10.27) 
If 7 is itself given by 
T = LF (10.28) 


* Here, and elsewhere, it is convenient to define a,;; = 0,1 <j. 
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Vector Input 


Weights f, Se ee Te I 
Wy SiN J21 nee Ini J} 
N 
Wry JiIN Jean -'" InN IN 
0 fia toy bal 81 
0 tie loo bne 8 
p . 
0 ti» lo» ae 8» 
Vector Output 
1 pe oe $n 6 
Zyy 0 Zyp  yy 
N 
Zn =o Zan Znn re) N 
ei) €91 ent hy 
C10 €20 eng he 
p ° 
fin Con enp hy 


Fic. 10.2 Notation for general augmentation. 


for some p x N matrix L, then, in view of (10.7), FA = ¢, and we have 


E = L¢. (10.29) 
The vector h is given by = 
h=g — 2 (Abader (10.30) 


where ¢ = (¢z1) Cxa) ++ + » xp) are the rows of E. 

In Sec. 10.12, we give an interpretation of (10.29) which is of use in 
numerical work. 

In the case where p = N =n, we may eliminate A in (10.27) and 
obtain 

E = TF 4 (10.31) 

Since ¢ is orthonormal with respect to the inner product (10.1”), we 
have ¢’'W¢ = I, and hence, 


FA = T7E¢'W. (10.32) 
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If we make the particular selection T = J, W = I (the first augmented 
situation described), then & = A and 


F- = Ad’. (10.33 


10.10 Uses for Orthonormalizing Codes 


We shall now indicate some areas where orthonormalizing codes have 
been found useful. The following list should be regarded as suggestive 
rather than exhaustive, and the reader will surely be able to augment it. 
A number of the following topics are discussed in detail in later sections. 

1. Expansion of a given function in a series of orthogonal functions— 
for example, in a series of Legendre polynomials, Chebyshev polv- 
nomials, or trigonometric functions. This is equivalent to a harmonic 
analysis or a ‘Chebyshev analysis,” depending on the set selected. 

2. Approximation in a least-square sense of a given function bv linear 
combinations of powers, rational functions, trigonometric exponenuals. 
other special sets of functions, or of sets of functions which are defined 
numerically by a set of values. 

3. Curve fitting of empirical data in two dimensions and in a higher 
number of dimensions. Smoothing. Extrapolation. The dual prob- 
lem of finding the best (least-square) solution to a system of N linear 
equations in n unknowns, n < N. In any computation laboratory, 
these are likely to be “‘bread-and-butter” problems for an orthonor- 
malizing code. 

4. Matrix inversion and solution of linear systems of equations. 

9. Approximation theory as applied to boundary-value problems ot 
potential theory or of more general linear partial differential equations 
of elliptic type. 

6. Least-square methods as applied to boundary-value problems otf 
ordinary differential equations. 

7. Least-square methods as applied to integral equations and other 
linear functional equations. 

8. Use of complex orthogonal functions for conformal mapping and 
for certain auxiliary conformal quantities. 

It should be pointed out that, although somewhat more efficient 
codes could be devised for the problems of the above list, the adapt- 
ability of a single code to a variety of purposes is a very attractive feature. 
The inputs for these problems have been found especially easy to handle. 


10.11 Least-square Approximations and Orthonormal 


Expansions 


Least-square approximations and orthonormal expansions are essen- 
tially identical. We deal with a finite interval which, for simplicity, 
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has been taken as —1 =< x S$ 1. If the functions f, are selected as 
f, = *(k =0,1,...) and w(x) = 1, then the orthonormalizing process 
generates the Legendre polynomials. Iff, = x* and w(x) = (1 — x?)-” 
or w(x) = (1 — x?)*, the process generates the Chebyshev poly- 
nomials of the first and second kinds, respectively. More generally, if 
w(x) = (1 — x)*(1 + x)*, a > —1, B > —1, the process will generate 
_ the so-called Jacobi polynomials P,"%”)(x). Because of the approxi- 
_ mate nature of the integration rule (10.21), there will be some deviation 
_ between the polynomials obtained numerically and those obtained by 
an exact integration, but in any case the orthonormal expansions 
- resulting are exact with respect to the inner product (10.21) which has 
_ been set up, and the least-square property ||_f —/*||? = min is valid. 
If we take for the functions f, the set 1, sin x, cos x, . . ., sin mx, cos mx, 
_ computed at 2m equally spaced points on an interval of length 27, and 
if w, = 1, the orthonormalization process will leave these functions 
- unaltered, and a Fourier expansion (really a trigonometric interpola- 
_ tion series) results. 

If the function f which is to be approximated is given in closed form 
or can be otherwise computed on a set of nonequally spaced abscissas 
x1, ..., Xx, then there is the possibility of using integration rules of great 
accuracy (such as the Gaussian). 

There is also the possibility of computing the orthonormal functions 

exactly in certain cases and obtaining an interpolation series in these 
orthonormal functions. Thus, for example, for Chebyshev poly- 
nomials of the first kind [7, = P,‘~'*~’®], we have the orthogonality 
_ relationship 


0 DbkK< N+ 1, 


’ 


(Qi4+1) (§=0,1,...,™). (10.35) 


WHE: AROS a eo 
Hence, if we select for f, the functions x* tabulated at the (N + 1) 

- abscissas (10.35) and take w, = 1 in (10.21), the orthonormalization 
procedure will yield the 7, exactly. 

The general phenomenon of which (10.34) is but a special instance 
may be described as follows. Let po(x), p,(x), ... be the set of ortho- 
normal polynomials which result from the inner product (10.20a). 
Let aninteger N be fixed, and let the zeros of the polynomial p,.,,(x) be 
designated by x,,...,%*y4,,;- Then, by a classical result (see, e.g., 

i Szego [30], p. 46), we have 


[eeu dx = ¥ pls) (10.36) 
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whenever p(x) is a polynomial of degree 2N + 1 at most. Here 4, 
are the Christoffel numbers corresponding to x, and defined by 


i= i” Pet — Vwi dx. 610.37 


Thus, we must have 
N+i1 
[ax p(x )pylx w(x) d. x= = 0, = ra 2 APM Px(*,) a; k= N. (10.36 


We may conclude from (10.38) that, if in the numerical orthonor- 
malizing process we take for f, the functions x* (k = 0,1,...,< 
computed at the Jacobi abscissas x,,..., X\4, of order N + 1 and use 
the Christoffel numbers A, as weights in the inner product (10.21), then 
the exact orthonormal polynomials will result. 

The appropriate zeros and weights are available in a number of 
instances. In the Legendre case (« = 8 = 0) they have been listed 
(Davis and Rabinowitz [12]) to order 48. For Chebyshev polynomials 
of the first kind (« = B = —}4), the zeros are given by (10.35) and 
the weights A, by A, =const (kK =1,...,MN+1). Thus (10.33: 
yields (10.34). For the Chebyshev polynomials of the second kind, 
U,,(z) (see, e.g., Szego [30], pp. 343-344; see also pp. 59, 369), the zeros 
of U,.,,; are given by x, = cos [kn/(N + 2)] (A =1,2,...,M + 1). 
whereas the Christoffel numbers are A, = const sin? [ka/(N — 2'] 
(A =1,2,...,N+ 1). Asymptotic values of x, and A, are available 
in more general situations. 


10.12 Curve Fitting 


We assume that WN pieces of data take the values 7,,..., 7, at the 
points x,, x,,...,*y, yielding the points (x,,9,). These points need 
not be distinct and may be listed in any convenient order, but we 
assume here that there are at least (n + 1) distinct abscissasamong them. 
Repetitions may therefore occur among the points (x,.7,). It is 
desired to pass that polynomial of degree n < N through these points 
which fits them best in the least-square sense. ‘To this end, we need 
only select f: (y),.--+ 53x) and fy: (x,*, xo, 2.65 Xy *)(A = = Ole an: 
The selection w, = 1 places equal emphasis on each piece of data, but 
an unequal selection of weights may be called for from time to time. 
A similar scheme may be employed in the case of two or more indepen- 
dent variables. The quantities d; [see (10.9) and Fig. 10.1] will be the 
coefficients of the minimal polynomial and the 46, the individual dis- 
crepancies. If all that is required is a plot of the minimal curve, this 
can be obtained directly from the discrepancies. In such a case the 
unaugmented scheme can be used. However, if this polynomial is 
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required explicitly, the augmented matrix scheme is convenient. 
Assuming that the least-square fit of experimental data /: (9, .. . 5 vx) 
by means of linear combinations of functions f;: (941) ) ia) - «+» Jin) 


(i = 1,2,..., 2) is given by f* = > d,f,, the variance of d;, var d,, and 
=1 


the covariance of d; and d,, cov (d,,d;) [var d; = cov (d,,d;)] are frequently 
desired in statistical analysis. These statistical quantities are easily 
expressed in terms of the quantities a,,. Utilizing (10.9) and (10.21) 
and assuming that 

COV (¥4,7;) = 6;;/Nw,, (10.39) 
it may be shown that 


cov (dd, — SD 0,45; (10.40) 
p=l 


In (10.40) we have written a,, = 0 for p < 1. 

The dual of the above in the multidimensional linear case consists 
in a least-square fitting of a point to asystem of hyperplanes. This may 
be described algebraically as follows (see, e.g., Whittaker and Robinson 
[32], p. 209). Let there be given a set of NW linear equations in n 
unknowns, n < N: 


43%, + Gor, t°°* + 4;,x, = B; @@Q =1,...,M). (10.41) 


It is desired to determine that vector (xf,..., x*) such that 


is /' 


» (Bi — Gxt — Aigxf — + * + — @,,x%)? = min. (10.42) 
i=1 
We assume that the system (10.41) is such that this problem has a unique 
solution. To employ the orthonormalization code, we take /f: (8, 
.., By) and f;: (@,,, @o;,.-+,4y;)- Weights w, = 1 should be taken. 
The quantities d, will be the required solution, and the 6, will be the 
individual discrepancies. 
In many working examples, one is asked to approximate data by, say, 
a polynomial. But what one may really want 1s, not the least-square 
polynomial itself, but some further quantity obtained from the poly- 
nomial, by a linear process—perhaps an integral or a derivative at 
specified points. If Z designates a linear operator and fa function (or a 
vector), this reflects the working rule 


Approximation to L( f) = L(approximation tof). (10.43) 


Now L(approximation to f ) may be conveniently computed by utilizing 
_ the scheme of general augmented input as follows. Augment each 
_ vector f,;(1 =1,..., 2) by the vector L(f,) and augment /by (0,..., 0). 
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Then, according to (10.29) and (10.30), the augmented 6 vector A will 
yield 


n re 


—h= 3 (fdas = 3 Gba)lbe = LS (fdelde (10.4) 


k=1 k=1 
= [(approximation to f ). 


10.13 Solution of Linear Systems of Equations and Matrix 
Inversion 


A special case of (10.41) arises when n = N, that is, a system of A 
linear equations in NV unknowns. If the matrix (4,;) 1s nonsingular, 
there is a unique solution which must coincide with the minimal 
solution xf,..., xy in (10.42), the discrepancies in this case being 
theoretically all zero. The technique of solution is therefore the same 
as in the last paragraph, but the following observations should be made. 
The 6 column prints out the discrepancies from the theoretical zero which 
arise from round-off. Ifit is necessary to save storage space, we do not 
have to augment the input vectors in the full way explained previously. 
Instead, we augment /; by the single value 8; and use a corresponding 
weight equal to zero. The vector f we take as (f,,...,8,, 0). It 
we denote by y, the number which then appears in the output vectors 
in the places corresponding to the augmented £;, then the solution 
xt,...,*% 1s given by 


N 
MS = > 2a een ee eee (10.45; 
k=1 


10.14 Partial Differential Equations of Elliptic Type 


We have already discussed the principal elementary uses for ortho- 
normalizing codes. However, there are many problems of a more 
advanced character in which orthonormalization or least-square 
approximation can occur as an important intermediate step. In the 
subsequent sections we review a number of such problems. 

The technique may be applied to the solution of boundary-value 
problems of linear partial differential equations of elliptic type. Asa 
simple case, let us suppose that we are dealing with the differential 
equation 

O7u = Ou 


3.3 + a A(x, y)u = 0, A(x,y) > 0, (10.46: 


and are required to find a function u which satisfies (10.46) and which, 
along the boundary #4 of a simply connected region B, takes on pre- 
scribed values f: u(s) =f (s)(the Dirichlet problem). Here s is a 
length parameter along 6. Suppose, first, that we have succeeded in 
constructing a number of particular solutions u, of the differential 
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equation (10.46). By a particular solution, we mean any function 
which 1s a solution of (10.46) but which need satisfy no boundary con- 
ditions. In the case of the potential equation (A = 0), for example, 
we may take for u, the harmonic polynomials. We denote by u,(s) the 
values which u,(x,y) takes on 6. We then solve the problem 


{Lr - a,u,(s)|? ds = min. (10.47) 


To arrive at this solution using our procedure, we introduce N points 
(x,;,¥;) on the contour 6. For the vector f we take La)» aici ig 
i (xy,yxy)], and for the vectors f; we take [u,(x,,9,),.-., U;(%y)Jn) ]- 
Weights w, appropriate to the distribution of the points (x a wo should be 
selected. The constants d, of the es will then be the required a,, and 


the approximate solution u*(x, 7) -> a,u,(x,y) may be computed. 


Appropriate modifications can be imeroduced whenever the boundary 
conditions involve normal derivatives. 


10.15 Some Theoretical Remarks on the Solution of 
Elliptic Partial Differential Equations 


The numerical procedure just described must be reviewed against the 
following theoretical background. The first problem is that of obtain- 
ing a family of particular solutions of the differential equation. If the 
differential equation is sufficiently simple, then particular solutions are 
immediately available. For instance, for the harmonic equation 
Au = 0, we may take the harmonic polynomials. 


Re(z™) (m = 0,1,...) and Im(z™) Cae ere 
(10.48) 


where z = x + 2y. For the harmonic equation in three dimensions, 
we may take the spherical harmonics 


rP™ (cos 0) ei". (10.49) 
For the biharmonic equation AAu = 0 in two dimensions, we may take 
Re(Zz™ + 2") (zZ=x+iuyj;m,n=0,1,...). (10.50) 


We cannot record here all the differential equations for which families 
of particular solutions are available in elementary form or in the form of 
special functions. For further information on this subject, the inter- 
ested reader is referred to Bergman [5, 7], Vekua [31], and Henrici 
(21, 21a]. Nor can we enter into the details of obtaining particular 


Google 


364 SURVEY OF NUMERICAL ANALYSIS 


solutions of general differential equations. This is a complicated 
problem to which much study has been devoted. 

The second theoretical problem which arises is that of the “complete- 
ness” of the family of particular solutions which has been selected. By 
this is meant the possibility of arbitrarily close approximation of the 
boundary data by means of the boundary values of linear combinations 
of the particular solutions. When the selected set of particular solu- 


n 
tions is complete, then, as n + oo, the approximate solutions > a,u, (x,y; 
k=1 


will converge uniformly to the theoretical solution, and hence the nu- 
merical computation has, at least, a theoretical chance of being nearly 
correct. If the set is incomplete, then it may not have this chance. 
The completeness of a given set may be related to the geometrical nature 
of the boundary, the connectivity of the region, and the type of boun- 
dary data which is allowed. On theoretical problems relating to 
completeness, see Bergman [5] and Fichera [20]. Here we record only 
that, for simply connected regions with smooth boundaries and boun- 
dary data which belong to L?, the harmonic polynomials (10.48) form a 
complete set of solutions for the harmonic equation. 

In numerical practice one works with only a finite number of par- 
ticular solutions, that 1s, with a system which is not complete. This 
directs attention to the third theoretical problem: that of developing 
error estimates for the discrepancy between the theoretical solution 
u and the approximate solution u*. This has recently been done bv 
Nehari in the case of the two-dimensional harmonic equation with a 
variety of normalizations [26]. Nehari also points out that his methods 
are applicable to the more general equation (10.46). 

Nehari’s error estimates are such that, having dealt with n particular 
solutions and having gone through the necessary orthonormalizations, 
one can then estimate the error incurred. It is not possible to use the 
estimate to tell at the beginning of the computation how many particular 
solutions must be employed in order to achieve a prescribed accuracy. 
We quote here one of Nehari’s theorems which is related to the type of 
experimental computations we have carried out: 

Let u(x, y) be harmonic in a convex domain A and let U(s) denote the values of 


uon the boundary C of A. Let u,(x,y),..., u,(x,3) be functions which are 
harmonic in A and which are orthonormalized by the conditions 

[ ua()an(e ds = Onn (10.51) 
If a,,..., 4, are the Fourwer coefficients 

= [ U"(s)u,,(s) ds, (10.32; 
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then 


(u(x) — 3 aut(e7)]* 


<|[m0 « -~ Sella Laas 
— 3 %(s,9) | (x’, 9’) on C. 


The right-hand side of this estimate tends to zero if n —» oo and {u,(x,y)} ts an 
infinite set which 1s complete in the space of harmonic functions u(x, y) for which 


[ we) ds < 0. 


The order in which the particular solutions are orthonormalized may 
also be of importance in numerical work. It is obviously better to 
insert first those functions which will approximate the boundary data 
in the best way. There may be cases in which one can tell beforehand 
which functions are better, but in general one merely adopts some 
arbitrary order. 


10.16 Conformal Mapping and Related Quantities 


It has been known for some time that the interior mapping function 
of a simply connected domain B, as well as its exterior mapping function, 
can be obtained from the complex polynomials which have been orthog- 
onalized over the boundary of B. 

Let B designate a simply connected region lying in the complex z 
plane whose boundary C is rectifiable. Let w(z) designate a positive 
and continuous weight function defined on C (or on B +C). In the 
space of analytic functions which are regular in B + C, we may intro- 
duce the inner product 


(fsa) =| fle)e(z)wo(2) ds (10.53) 
and orthonormalize the powers 1, z, z?, z3,. . . with respect to this inner 
product. We designate the polynomials which arise in this fashion by 

p,(z) = A,2" +..., k, > 0. (10.54) 


We designate by ¢(z) the function which maps the exterior of C con- 
formally onto the exterior of |w| = 1, $(00) = «0, ¢’(o) = 1. If zis 
exterior to C, we have 


lim PnsilZ) 


ee ee 
a= 10.5 
ast ee Sai, 
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where ¢ is the transfinite diameter of C. These results are independent 
of the weight function w(z). 

Let y(z) map the interior of C conformally onto the interior of 
jw] = 1. Let the weight w(z) = 1; then 


= a YS, L ‘ ae 4 on 
d Pal2Z)Palt) = 5 [v'(z) yO) (10.57) 
n=0 we 
where L is the length of C. For details on these matters and for some 
information on the rapidity of convergence, see Szegé [30], pp. 355- 
366). 

Orthogonal polynomials will also arise from the inner product 


(fig) = { | f (z)g(z) dx dy. For identities parallel to those above, see 
B 


Bergman [5]. 

The quantity ¢ in (10.56), known as the transfinite diameter of B, was 
originally introduced into analysis in a different form by Fekete [18]. 
According to Fekete’s original work, the transfinite diameter of an 
arbitrary closed bounded point set E, 7(£), is defined as 


(E) = lim Ola (*) = mt) (10.58) 
where V, = max |V(z,, Zo,.--, Za)| (10.59) 
2,€E 


and where V is the Vandermonde determinant for the indicated argu- 
ments. The identification between 7(£) and certain conformal and 
potential theoretic quantities has been carried out by both Fekete and 
Szeg6 [29]. A generalization of the relationship (10.56) has recently 
been obtained by Fekete and Walsh ([19], p. 61): 

Let E consist of a finite number of rectifiable Jordan arcs. Let 


PAZ) 4,2" pied, > 0 (n= 0, 1,343) 


be complex polynomials that are orthonormal over E in the sense that 


[bal 2) Pal2) ds = Ban 


Then we have 


7(E) = lim (1/a,)". (10.60) 


na 


10.17 Some Linear Functional Equations 


In the above scheme for a partial differential equation, we have 
satisfied the equation and approximated the boundary conditions. 
There are situations in which we want to do the reverse. One such 
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occurs in the solution of linear functional equations by the least-square 
method (see, e.g., Gollatz [9], pp. 130, 321, 384). Let L(y) = f(x) bea 
nonhomogeneous linear functional equation with solution y(x) subject 
to certain homogeneous or nonhomogeneous auxiliary conditions. To 
solve this problem, we let 


yx) = sole) +S ancl) (10.61) 


where 3’9(x) satisfies the nonhomogeneous auxiliary conditions and }», 
the homogeneous ones. We then choose the a, so that 


IZ(y*) —S (x) || = min, 

where some appropriate norm has been introduced. The way these a, 
are computed using the orthonormalizing code is as follows: for 
f(t =1,..., NM) the values of L[y,(x)] are set at a fixed set of NV points, 
n < N, and for f the values of f(x) — L[_y,(x)] are set at these same 
points. The d; are then the a, required. For example, let there be 
given a second-order linear differential equation _»” + g(y)y’ + A(x)y = 
/(x) with the boundary conditions y(a) = ¢ and y(b) = d. We choose 
Jo(x) so that _y,(a) = c and y,(b) = d and choose y,(x), 7 > 0, so that 
J;(a) = 9,;(6) =0. Then /f; is the vector whose components are 


Jj (%) + gee) + AG) (= 1,2,.--, 4%). 
As a second example, let us take the linear integral equation of the 


second kind . b 
v(x) —A] (a y(0) dt =f (0). 
We set ; n 
o*(x) = Sanl, 


where y,(x) are conveniently chosen functions. Then the components 
of f; should be 


\ Is(xi) — A [ Retlyi(l dt (i=1,2,...,N). 


_ Finally, we mention the method of upper and lower bounds for 
estimating quadratic functionals (such as capacity and _ torsional 
rigidity) where an orthonormalization process occurs at an intermediate 
level. For a description of these matters, the reader is referred to 
Diaz [15]. 


10.18 Experimental Computations Using Orthonormalizing 
Codes 


We now give the details of a number of problems, principally in 
potential theory and conformal mapping, that have been handled by 


Google 


368 SURVEY OF NUMERICAL ANALYSIS 


the methods just outlined. The computations can be described briefly 
as follows: 

1. Solution of a Dirichlet problem for a “bean-shaped” region. 

2. Computation of the system of orthonormal polynomials for the 
bean-shaped region. 

3. Computation of the system of orthonormal polynomials for a square. 

4. Solution of a Dirichlet problem for an irregular pentagon and 
utilization of Nehari’s error estimate. 

5. Computation of the transfinite diameter of two collinear line 
segments. 


10.19 Dirichlet Problem for a Bean-shaped Region (Code I) 


For complete details, see Davis and Rabinowitz [13]. A bean- 
shaped region (see Fig. 10.3) was obtained from a freehand drawing on 


=0.7 SVA05 -0.4 -03 -0.2- 
@) 


01 0 
-0.1 


Fic. 10.3. Bean-shaped region used in computation. 


coordinate paper. ‘The region itself is ‘‘defined” by means of 43 points 
on the contour (see Table 10.1). These points are not distributed 
equally on the boundary; somewhat more points appear where the 
curvature is greatest. 

Although certain theoretical difficulties occur when nonconvex 
regions are employed, we were interested in testing the process for a 
fairly intricate region. Since the region was not specified analytically, 
no attempt was made to incorporate into the weights w, [see (10.21)]a 
very exact line element ds or a very exact rule of numerical integration. 
For this region, weights w, were taken proportional to the distance 
between the successive points given on the contour. These are listed 
in column 4 of Table 10.1. 
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.01414 
01427 
.01963 
.02300 
.03897 


.02792 
.03324 
.01483 
.01423 
.01505 


.01483 
.01420 
02881 
.03043 
.03076 


03311 
.03175 
.01809 
.01998 
.01882 


.03140 
.03450 
02846 
02831 
.03860 


02431 
02059 
.03566 
.03122 
.02975 


.02846 
.01696 
.02330 
.02102 
01795 


01147 
.01762 
.01648 
.01901 
01901 


.01809 
.01677 
01501 


Boundary 


value 


-76089 
72025 
.66721 
95236 
.40068 


17014 
.06949 
.02006 
.04590 
12037 


.22850 
.33248 
.41180 
.96168 
-70060 


84177 
98915 
1.12198 
1.19326 
1.26792 


1.33734 
1.44504 
1.54875 
1.64168 
1.73582 


1.85882 
1.91643 
1.95563 
1.99206 
1.96611 


1.89648 
1.75623 
1.63625 
1.41224 
1.14765 


91523 
78912 
.68785 
.66231 
.69067 


.72694 
75642 
.77202 


Discrepanc 


— .0030 
—.0031 
— .0032 
— .0034 
— .0032 


— .00006 
.0044 
.0069 
0023 

— .0042 


— .0069 
— .0026 
.0014 
.0038 
.0009 


—.0019 
— .0023 
.0001 
.0018 
.0029 


.0027 
.0000 
— .0025 
—.0017 
0015 


0037 
.0014 
— .0013 
— .0030 
.0004 


Boundary value = 
e cos y + log 
[(1 — »)* + x*] 


Least-square harmonic 


polynomial 


1.0017261087 
+ .997339446 
— 1.991187716 
+ 1.48065453 
— .00949996 
+ .1889575 
+ .6236775 
— 355600 
+ .024526 


—.11960 
— .28034 


Re (z) 
Im (z) 
Re (z?) 
Im (z*) 
Re (2°) 
Im (z°) 
Re (2) 
Im (2*) 
Re (z°) 
Im (z°) 


Discrepancies at points 


interior to bean 
x, 4-0 Discrepancy 


riot 
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As boundary-value data, we used the values of the harmonic function 
u(x,y) = Re[e? + log(z —2)] (10.62) 


at the 43 points on the boundary. ‘These are listed in column 5 of 
Table 10.1. These boundary data were approximated by linear 
combinations of the 11 harmonic functions 1, Re(z),..., Re(z°), 
Im(z°). 

The input data for this problem were, accordingly, w; = weights of 
column 4, Table 10.1, 


 _ fRe 
J ix = \Im 
Se = Re [ee >™ + log (x, + i, — 1] (10.63) 


Column 6, Table 10.1, lists the discrepancy between the specified 
values and the computed (least-square) values along the contour. It 
will be seen that the highest deviation is .0069. If one knew that this 
was the greatest deviation over the whole contour, then the maximum 
principle for harmonic functions would indicate that this 1s also the 
greatest deviation in the interior. Unfortunately, it is impossible 
theoretically to make such a conclusion (for a theoretical discussion of 
this point, see Payne and Weinberger [27] and Nehari [24, 26]), but one 
feels that in the interior these deviations are also of the same order of 
magnitude. We have computed the deviations at 10 points along the 
real axis in the interior of the region and have listed them in Table 10.1. 
These results bear out this feeling. 


10.20 Orthogonal Polynomials for a Bean-shaped Region 
(Code I) 


The input data here were as follows: w; = weights of column 4, 
Table 10.1, yy = (4, + ,)’, y = arbitrary. As part of the output 
data, we obtained the coefficients of the orthonormal polynomials and 
the values of each orthonormal polynomial at each of the 43 points on 
the contour. We obtained the orthonormal polynomials up to and 
including those of degree 21. For reasons which are explained pres- 
ently, it is not felt that the polynomials of degree greater than 11 are of 
great significance numerically. 

Table 10.2 presents the ratiosk,/k,,, forn =0,...,10. According 
to (10.56) these ratios approach the transfinite diameter of the region. 
The convergence of this sequence is not too rapid, but the table suggests 
that we have determined this constant to 2 decimal places. We have 
computed these ratios also for n = 11,..., 20, but have not tabulated 
them here. ‘Their behavior is steady for a while and then, as n > 1], 
they begin to increase rapidly toward 1. This is the result of two things. 


(x, ae 1)',) - 
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Tasie 10.2 DETERMINATION OF TRANSFINITE DIAMETER 
OF BEAN-SHAPED REGION 
n Kalknsy 


485511 
.913294 
903448 
903615 
.906216 
906834 
908043 
907085 
907510 
505941 
.908073 


SOV ON OO h ON — © 


— 


In the first place, there is a considerable loss of significance in the co- 
efficients of high order, since these values have to be scaled down 
sufficiently to fit on the machine. Secondly, because only crude 


integration rules were employed in computing | z"2Z" ds, the ortho- 
C 


normal polynomials themselves tend more to those corresponding to 
finite-sum inner product as n approaches the number of points on the 
contour. 

According to (10.55), the ratio p,,,(z)/p,(z) tends to the exterior 
mapping function. We have tested this out for n = 10. The worst 
agreement can be expected on the boundary of the region, where a 
theoretical value of |¢(z)| = 1, z EC, should be obtained. Table 10.3 
lists the values of |6,,(Z)/f,9(z)| on the contour C. A maximum error 
of 10 per cent from the theoretical value of 1 was obtained. The 
average error appears to be about 5 per cent. From the values of 
fy9(z) on the contour it was a fairly simple matter to trace the variation 
In arg f49(Z), Z EC, and to verify that all the zeros of ,)(z) lie in B. 
Thus, p,,/Pio is regular outside of B. 

As might have been foreseen from the behavior of the ratios k,,,,/k,, 
for n > 11, no improvement in the quantities 


|Pnoi(Z)/Pn(Z)| 


was observed for n > 11. 


10.21 Orthogonal Polynomials for a Square (Code I) 


For machine purposes, it was convenient to have all distances from 
the boundary to the origin <1, and so the side of the square was taken 
tobe a = 1.4. Since the boundary of the square consists of elementary 
curves, it is not too difficult to employ high accuracy integration for- 
mulas in (10.21). In computing with the square, we selected along 
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TABLE 10.3. THE QUANTITY we EVALUATED ON THE 
10 
BOUNDARY OF THE BEAN 
Pt. no. | Pir/Pro | Pt. no. | Pir/Pro | 
] .937 
2 1.018 
3 1.016 
+ .968 
5 1.069 
6 .985 
7 .996 
8 1.056 
9 .940 
10 1.050 
11 1.026 
12 .976 
13 1.034 
14 .984 
15 .981 
16 1.029 
17 1.057 
18 1.033 
19 .981 
20 .928 
21 .890 
ra A 


each of the sides a 16-point Gaussian integration formula. Inasmuch 
as the function z* = (x + 1y)* is, along either x = const or y = const, 
a polynomial in_» or in x of degree k, this Gaussian integration formula 
will produce inner products which are completely accurate, neglecting 


machine round-off, up to the terms | zz!5 ds. No particular use of 
is 


the symmetries of the square was made, and the cyclic occurrence of 
many zero coefficients in the orthonormal polynomials served as 4 
running check on the accuracy of the process at the machine end of the 
job. The orthonormal polynomials are listed in Table 10.4. Table 
10.5 lists the ratios k,,/k,,,, which approach the transfinite diameter of 
the square. The theoretical value for this quantity (see Pdlya and 
Szego [28], p. 252) is 
[1(1/4) ]* 
6 =: 1:4 soa am = .826238. 

Thus, using orthonormal polynomials of degree 15, we have secured 
this quantity to 3 significant figures. 
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TABLE 10.4 ORTHONORMAL POLYNOMIALS FOR A SQUARE; 
Sipe = 1.4; Sum or Gaussian WEIGHTS = 1.0 


1.0000000000 

1.2371791483 Z 

1.4937246015 Z* 

1.7603713248 Z% 

2.2025044571 Z* | +0.4230570561 

2.6608221383 Z> | +0.7301295947 Z 

3.220657584 Z® | +1.1572896252 Z? 

3.905515952 Z? | +1.7238789614 Z3 

4.737815070 Z® | +2.4298575834 Z4| —.0188295964 

5.737742726 Z® | +3.3915415979 Z5| +.0646244543 Z 

6.949286858 Z!°| +4.645751758 Z®| +.2286090195 Z? 

8.416037589 Z| +6.274558086 Z?7| +.5137415655 Z§ 

10.19384526 Z!*| +8.381528538 Z®| +.9916706683 Z4 | +.0261223727 
12.34288234 Z18)+11.102777658 Z®| +1.7111406242 Z5| +.0433444223 Z 
14.94442510 Z4/+14.597987726 Z') +2.782242918 Z®&| +.0869122976 Z? 
18.09361812  Z}!5|+19.073026420 Z| +4.339843064 Z* | +.1803740299 Z8 


TABLE 10.5 DETERMINATION OF TRANSFINITE DIAMETER 
OF SQUARE; SIDE = 1.4 


n Kalknsy 

0 .808290377 
] 828251169 
2 848528137 
3 .799258916 
4 827753357 
5 .826173559 
6 824643305 
7 824328492 
8 825728043 
9 8256592 14 
10 825719560 
11 .825599896 
12 825888555 
13 825918846 
14 .82 5950067 

10.22 Boundary-value Problems for an Irregular Pentagon 


(Code III) 


For complete details, see Hochstrasser [22]. An irregular pentagon, 
inscribable in a circle, was selected (see Fig. 10.4), and the following 
boundary-value problems were set up: 

1. Boundary values given by u(x,y) = Re (cos z), z = x + 1. 

2. Boundary values given by u(x,y) = Re [I/(z + 4)], z=x+y. 

3. Boundary values equal | on one fixed side and 0 on all other sides 
(harmonic measures). 
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Fic. 10.4 Irregular pentagon used in Fic. 10.5 
computation. 


We report here only on problem 3. The harmonic measures of the 
sides of a polygon are intimately related to the constants of the Schwarz- 
Christoffel formula (see Ahlfors [3], p. 45). 

In view of the discontinuity of the prescribed boundary values at two 
successive vertices of the pentagon, (x,,7,) and (%2,72.), it seems best to 
proceed by subtracting out the discontinuity. This can be accom- 
plished by finding a harmonic function which possesses the same jumpsat 
(x,,9;) and (%»,7.) and using it as the first approximation. Such a 
function is given by 


neg = eer (10.64) 


a3 — 20 


where ¢ is the angle subtended at the point (x,y) by the fixed points 
(1,71) and (x92) (see Fig. 10.5). This function up is the harmonic 
measure of the arc of the circle joining (%,,7) and (% 9,72), which means 
that it takes on the value 1 on this arc and the value 0 on the comple- 
mentary arc. The approximation to the given values on the pentagon 
was accomplished by using the functions up, 1, Re (z), Im (z), Re (z?), 
Im (z?),...,zZ=x-+ 1). A total of 12 harmonic functions was 
employed. For the evaluation of the inner products, the integrals 
along the sides of the pentagon were evaluated by a Radau rule. This 
is an 8-point modified Gaussian quadrature formula which involves 
‘the end points of the intervals and is exact for polynomials of degree 15 
or less. 

Table 10.6 presents for the harmonic measures of sides I to V the 
Nehari estimates (Sec. 10.15) at the point x = y = 0, as well as the 
maximum discrepancy between the prescribed boundary values and 
those given by the least-square solution. For the boundary-value 
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TABLE 10.6 Errors FOR HARMONIC MEASURES OF SIDES OF PENTAGON 


Side I II III IV V 


Nehari estimate at 
SS HO 2s .0146 .00556 .0239 .00945 .0201 


eT semen Ree ee ne ee need 


Maximum error on 


boundary... — .0520 —.0279 — .0945 0335 .0788 


problems 1 and 2, the exact difference between the theoretical value 
and the computed value can be computed, and it was found that the 
predicted error was in excess of the observed error by factors which ran 
from 4 to 100. In the case of the harmonic measures, the exact values 
of the solutions are not known, and the predicted error must, therefore, 
be viewed in the light of this remark. 

The values show, however, that the predicted error bound is better 
at x = y = 0 than the one suggested by the maximum error on the 
boundary. 


10.23 Numerical Computation of the Transfinite Diameter 
of Two Collinear Line Segments (Code II) 


For complete details, see Davis [14]. The transfinite diameter is 
known explicitly for a number of elementary geometrical configura- 
tions, but, in general, its numerical evaluation is attended by consider- 
able difficulty. For two collinear line segments of equal length placed 
symmetrically with respect to 0, say EF: —1 <x < -a,a <x <1, 
O<a< 1, the value of 7(£) is known theoretically and is simply 


r(E) = ¥V 1 — a? (10.65) 


For two collinear line segments of unequal length, the value of 7(£) has 
been obtained by Achieser [1] and can be expressed as the ratio of 
certain elliptic functions. For more than two line segments, a closed- 
form value of 7(£) is not known to the author. 

The relationship (10.60) was tested to see what could be achieved by 
way of accuracy. In these computations, the value @ = 1/2 was 
selected, leading, in (10.65), to 


r= % V3 = .4330127. (10.66) 


The inner products were computed by means of a 10-point Gaussian 
quadrature rule on each of the two segments (—1,—}2) and (14,1). 
The machine was programmed to print out the coefficients of the ortho- 
normal polynomials, as well as the values of these polynomials at the 


Google 


376 SURVEY OF NUMERICAL ANALYSIS 


Gaussian abscissas employed. In this way, it was possible to monitor 
the obvious global properties of the orthonormal polynomials, as wel: 
as to see where the accumulated round-off began to vitiate the com- 
putations. One way in which this was done was as follows. The 
polynomials ,(x), orthonormal over the set E: (—1,—43}, (12.1, 
are alternately even and odd. As n increases, the theoretical zero 
values for the alternate coefficients of p, become contaminated by 
round-off, and these “‘zeros’”” begin to assume the proportions of the 
nonzero coefficients and of the values of the orthonormal polynomials 
themselves (see Table 10.7). It was found that all significance was 


Taste 10.7 ORTHONORMAL PoLtynomiaLs oN (—I, —]9), (19, 1) Exnisimc 
EFFECT OF ROUND-OFF 


Coefficient 
of: I x x? x3 xt 
Po 1.0000000 
pi -00041000000 1 1.3093073 
be — 2.6843775 — 00000000003 4.601 7900 
D3 — 0000000002 —4.1140778 .0000000002 6.1932355 
pa 7.6721535 — 0000000004 —28.177018 -OB000000004 23.822712 
Ps QOOQQ0007 11.694401 — .00000003 — 40.918493 ae ants 
Ps — 22.558405 — .QO0U0005 131.20021 .00U00017 — 230.407 
; — 0000083 — 34.093102 -OOOO48 189.08749 — (hae sy 
Ps 66.959357 .000078 _ — §24.56943 — .OOO43 143§.1255 
Py 00065 100.83287 — .005 — 763.6782 Us 
Pro — 199.48833 — 0021 1963.8362 015 — 7312.7657 
Coefficient 
of: x5 x8 x? x8 x? glo 
Po 
pi 
Pe 
Pa 
Mm 
Ps 31.868611 
Ps — —.0000014 125.35317 
7 — 319,27433 000046 167.38838 
Pr .0007 1 — 1637.5651 — 00037 663.46546 
Po 2022.6110 —.015 — 2241.6237 .0059 885.36970 
fio — 039 12922.714 .041 — 1089 1.666 — U157 3521.15 97 


lost when an attempt was made to go beyond n = 10. The last value 
of (1/a,)/" gave the value of 7(£) correct to within .009. Additional 
accuracy is obtainable only by employing double-precision coding 
and going beyond n = 10. 

Although closed-form expressions for the orthonormal polynomial: 
over E: (—1, —)%), (4,1) are not available, such expressions are 
available for the Chebyshev polynomials for E (see [2], p. 287). Here 
the even and odd polynomials have a totally different structure. 
Using the latter polynomials as a guide (in the theory of domain polv- 
nomials these two sets frequently behave alike), we can confirm the 
slightly higher values for (1/a,)!" registered in Table 10.8 for odd 1. 
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TABLE 10.8 ComPpuTATION OF TRANSFINITE DIAMETER FROM 
LEADING COEFFICIENTS OF ORTHONORMAL POLYNOMIALS 


n ay, 1/%a,, V anlanis 
l 1.3093073 ./6376 .45979 
2 4.6017900 .466 16 43951 
3 6.1932355 54454 44084 
4 23.822212 45264 43594 
3 31.86861 1 50041 43633 
6 125.35317 .44700 .43467 
7 167.38858 .48120 4348 1 
8 663.46546 .44389 .43408 
9 885.36970 47048 

10 3521.1597 44191 


Theoretical value: 
— 4330127 


We can also conclude from the form of the Chebyshev polynomials that 
the ratio (a,/a,,2) would be a good estimator for 7(Z). Table 10.8 
also presents these values, and it will be seen that the last entry, 
(a,/a,9), yields +(E) correctly to within .001. 
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GENERAL INTRODUCTION 


Finite-difference methods afford a powerful tool for the solution of 
many problems involving partial differential equations. The domain 
of the independent variables is replaced by a finite set of points, usually 
referred to as mesh points, and one seeks to determine approximate 
values for the desired solution at these points. The values at the mesh 
points are required to satisfy difference equations obtained either by 
replacing partial derivatives by partial difference quotients or by certain 
other more sophisticated techniques. 

In practice, the decision regarding how many mesh points should be 
used to achieve a desired accuracy is usually based on intuition and 
experience, since practical error estimates are not available. Some- 
times approximate solutions are obtained with two different sets of mesh 
points, one a refinement of the other, and the results compared, it being 
assumed that the accuracy of either of the two approximate solutions is 
about the same as the difference between them. The main problem 
from the computational standpoint, however, and the one primarily 
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considered here, is that of actually computing the solution of the differ- 
ence equation. For an elliptic equation the problem is that of solving 
a system of linear algebraic equations 


Au + d=), (11.1) 
whereas for parabolic equations one has to solve a system of ordinary 
differential equations a 

7 = Au + d. (11.2) 


Here A is a given square matrix, dis a given column matrix, and wu is an 
unknown column matrix. 

It is not difficult to show that the matrix A in (11.1) is nonsingular 
and hence that a unique solution exists. However, since there is one 
equation for each mesh point and since there may be several thousand 
mesh points, great care must be taken in the choice of the method for 
solving (11.1), lest the computing time, even for a very fast computing 
machine, becomes excessive. Iterative methods are indicated because 
of the large number of zero elements of A and are nearly always used. 
With such methods one guesses an initial approximation to the solution 
of (11.1) and successively modifies this approximate solution, according 
to a given rule, until convergence has been achieved to within a pre- 
scribed tolerance. The primary purpose of the first part of this chapter, 
Secs. 11.1 to 11.13, is to describe a number of iterative methods and to 
give results on the rates of convergence of these methods. Methods for 
setting up finite-difference analogues for problems involving elliptic 
equations are also discussed in this part. 

Depending on what numerical procedure is used to solve (11.2), it may 
or may not be necessary to solve systems of linear algebraic equations. 
With “explicit” methods no such systems are involved. However, 
in such cases the increment in ¢is limited by stability considera- 
tions. Such limitations can be partially or wholly removed by use of 
“implicit” methods, which may, however, involve solving systems of 
equations. The second part, Secs. 11.14 to 11.18, describes explicit and 
implicit methods for solving parabolic equations with one or two space 
variables in addition to the time variable. Also contained in this part 
is a discussion of the relation of some of these methods to certain iterative 
methods for solving problems involving elliptic equations. 


ELLIPTIC EQUATIONS 
11.1 Introduction 


As already noted, the rate of convergence of the iterative method used 
is of critical importance in the solution of a boundary-value problem 
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involving a linear elliptic partial differential equation. Following intro- 
ductory discussions in Secs. 11.2 to 11.4 on the setting up of finite- 
difference analogues of such problems and on the accuracy of the 
numerical solutions, we present in Secs. 11.5 to 11.7 discussions of 
several iterative methods. Point iterative methods such as the Gauss- 
Seidel method and the successive-overrelaxation method are considered 
as well as line relaxation methods and an alternating-direction implicit 
method. The various methods are described, and rules and suggestions 
are given for their use and for estimating the convergence rates. Theo- 
retical discussions are postponed until Secs. 11.8 to11.12. Here, discus- 
sions of the theory on which some of the methods are based are given, 
together with some proofs. Certain generalizations and extensions are 
given, particularly for the successive-overrelaxation method. The 
reader who is interested in the use of the methods but not in the under- 
lying theory can omit these later sections. Applications and numerical 
experiences are described in Sec. 11.13. 


11.2 Boundary-value Problems 


We restrict our attention here to the following class of problems. Let 
R be a bounded plane region with boundary S and let f be a function 
defined and continuous on S._ Find a function U(x, y) which is con- 
tinuous on R + S, is twice differentiable in R, satisfies in R the linear 
second-order partial differential equation 


AU,, + CU,, + DU, + EU, + FU =G, (11.3, 

and satisfies on S the condition 
U = 7. (11.4 
Here A, C, D, E, F, and G are analytic functions of the independent 
variables x and yin R and satisfy in R + Sthe conditions A > 0,C > 0, 


F <0. Because of the conditions on A and C, Eq. (11.3) is said to be 
elliptic. More generally, an equation of the form 


AU,, + 2BU,, + CU,, = D(x,y,U,U.U,) Che 


is said to be elliptic in R if B®? — AC <0 in R. We remark that, bv 
introducing new variables & and 7 satisfying the conditions 


i i nm _ [A r 
ENG’ 7. NG? (11.6 

we obtain an equation of the form 
ales ate yU i, = 6(6,7,U,U;,U,), (11.7) 


where a > 0, y > 0 in R’, the region in the (&,7) plane corresponding 
to R. Equation (11.7) 1s similar to (11.3) as far as the terms involving 
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the second-order derivatives are concerned. Much of our subsequent 
discussion applies to (11.7) as well as to (11.3). Much of our later 
discussion also applies to cases where, instead of prescribing U on S, we 
prescribe the normal derivative 0U/0dn or a linear combination of U and 


dU'/on. 
11.3 Construction of the Difference-equation Analogue 


Since the problems formulated in Sec. 11.2 cannot be solved analyti- 
cally except in a very few special cases, it is usually necessary to resort to 
approximate numerical methods. Of these, the method of finite differ- 
ences seems best adapted for the solution of most problems on high- 
speed computers. 

Let (%, 7) denote an arbitrary but fixed point in the (x, y) plane and 
let A be a fixed positive number, which we call the mesh size. Let %, 
denote the set of points (x, y) such that both A-!(x — x) and h-}(y — 7) 
are integers. Two points (x,y) and (x’, y’) of &, are adjacent if (x — x’)? + 
(y —.»’)? =/h?. We let R, denote the set of all points belonging to both 
x, and RX. Points of R, are called interior net points. We determine the 
set 5, of boundary net points as follows. For each point (x, y) of R,, con- 
sider the 4 adjacent points of 2,, which we denote by (x,, 9,;),2 = 1, 2, 3, 4. 
For each 2, consider the open line segment /,; joining (x,y) to (x,,7,). 
Three cases can occur: 

1. If/, does not intersect S and if (x,, y;) is a point of R,, then no point 
of S, belongs to /,. 

2. If 2; does not intersect S and if 
(x,,.y,;) is a point of S, then (x,, ,;) isa 
point of S,. 

3. If J; intersects S$, then we let 

the point of intersection nearest to 
(x,,¥;) belong to S,. 
The set S, consists of all points found 
by considering each point of R, and 
the corresponding four line seg- 
ments /,. 

As an example, consider the ellipse (x/3)? + (y/2)? = 1,withs = 7 = 
0 and with & = 1 (Fig. 11.1). Points of R, and S, are indicated by 
solid black circles @ and open circles ©, respectively. Points of X, not 
belonging to R, or S, are exterior points and are indicated by open 
squares [_]. 

The simplest and most direct method of deriving a difference-equation 
representation of (11.3) is to replace the partial derivatives by suitable 
difference quotients involving the values of U at points of R, and/or S,. 
If this is done for each point of R,, one then obtains a system of N linear 
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algebraic equations involving the VN unknown values of U at points of R,. 
We recall that the values of U at the boundary points of S, are given. 

To illustrate the derivation of 

(x2, 2) the difference equations, let us first 

consider a point (%9, y)) of A, such 

that the 4 adjacent points in 2, also 


r belong either to R, or to S,. Such 

a point is known as a regular point. 

(x3, ¥3) (Xo, Yo) (x1,¥,) We seek to determine a difference 
h h expression corresponding to (11.3) 

which involves values of U corre- 

h sponding to the 5 points shown in 


Fig. 11.2. Clearly more accurate 
difference expressions could be 


(X44) found if one were to consider add- 
Fic. 11.2 tional points, but at the sacrifice of 
simplicity. 


We first obtain difference quotients corresponding to U,,. These 
difference quotients involve U5, U,, Us, where we let U,; = U(x,,5,), 
2 = 0, 1, 2, 3,4. Similarly, we let (U,), = U,(x,,9,;), etc. Assuming 
that U(x,y) has continuous partial derivatives of fourth order in a sufh- 
ciently large neighborhood about (x9, 79), we have, by Taylor’s theorem, 


h2 hs ht 


U, _ Us = (U,) oh ae (Uz2)0 5 - (Uzz2)0 3] =f (Us xya4 41 5) (11.8) 
where CORP i = O perz( $s Io) 
where Xo 6S Aye 
Similarly, 
h? hs kA 


Us; = U, — (U,,) oh + (Usz)0 5 = (Uzz2)0 3) Bi (Ozee2)03 7 . (11.9) 


Combining (11.8) and (11.9), we have 


U,+ U, —2U h? 
(U22)o = a == 4! CO osee) 64 alr CU 252) aa): (11.10) 
Using the same method, we can also obtain the following: 
U,+U0, —2V, h? 
(U,,)o — pg ee ara 41 Geen Pr + (O eaeals (11.11) 
U,-—vU h2 
(U.)o = oa a 12 KeeArnre a Zeal (11.12) 
: UL, —-—U h? 
(Cy)o = ey —~ 7 [(Gyys)o2 = CU igloale (11.13) 
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Neglecting the remainder terms in (11.10) to (11.13) and substituting 
in (11.8), we obtain for each regular point (x,y) in R, the difference 
equation 

(A + ’Dh)U(x + hyy) + (A — WDA)U(x — hy 9) 

+ (C + Eh) U(x,» +h) + (C — YER) U(x, y — h) 
—2(A +C — Fh)U(x, y) = Gh?, (11.14) 


where A, C, D, E, F, and G are to be evaluated at (x,y). Writing (11.14) 
in the form 


a, U(x +h, y) + agU(x — hy) + aU (x,y + h) 
+ 04U (x,y — A) — aU (x,y) = t(x,y), (11.15) 


where the a, (: = 0, 1, 2, 3, 4) are functions of (x, y), we observe that, 
because A > 0,C > 0, andF < 0, 


By Sa +H + ag + Oy. (11.16) 


Even for irregular points of R,, that is, points of R, which are not regular 
points, we can write the difference equation in the form (11.15). Thus, 
ifx, — X%» = 5h, xy) — X3 = Sgh, where 0 <5, < 1,0 <5, < 1, wecan, 
with the aid of Taylor’s theorem, easily derive the following formulas: 


, U9|, (11.17) 


S S Sq — § 
Oy Ao] 5 US Be Yet 
(Ua)o 5y(5y + 53) . 53(5) + 53) : 5153 


vue [+t Lah ue 


5(5, + 53) 7 * 55(5, +53) 9 5453 
Similar expressions can be derived for (U,), and (U,,))._ Ifthe resulting 
expressions are substituted into (11.3), one obtains 
aU (x + sh, y) + agU(x — sgh, y) + agU(x, 9 + 52h) 
+ a,U (x,y — sgh) — a U(x,y) = t(x,y), (11.19) 


where 
2A hs,D 2A hs,D 
eo = ——— + ———__., % = — ——_,, 
; 51(5, + 53) 5, (5. + 5) ' 53(5; + 53) — 53(5y + Sz) 
2C hs FE 2C hs,E 
Lo — ee a ea a Ly eae, Fa a a ae ep ape eer ee 
59(Sq + 54) Sa(S_ + 54) S4(Se + 54) S4(Sg + 54) 
% =a, ta, +a, + a, — Fh, t = Gh?. (11.20) 


The proof of the existence of a unique solution of the difference equa- 
tion and of the convergence of many of the numerical procedures for 
solving the difference equation is greatly simplified if the a, appearing in 


Google 


386 SURVEY OF NUMERICAL ANALYSIS 


(11.15) are all positive. Evidently all the x, will be positive provided 4 is 


chosen so small that ie 0 
_ {2A 2C) Sota ey fee, 
h [eee Tle ae Re 
< min \\DI 2 \E\/ 3 Ries ane \ , 


the minimum being taken over all pointsofR +S. Since A > 0,C > 0, 
and since A, C, D, E are continuous and hence bounded in R — S, it is 
clear that a positive minimum exists. 

In the important special case in which (11.3) 1s essentially self-adjoint, 
that is, the case in which 


D—A E-—C, 
B a = y 9: ‘ 


J 


one can obtain a difference equation for which local accuracy is as good 
as that for (11.14), both for regular and for irregular points, and for which 
the coefficients «, are positive for all k > 0. Moreover, if there are no 
irregular points in R,, the difference equation is symmetric, in the sense 
that the coefficient of U(x’, y’) in the equation for (x,y) is the same as the 
coefficient of U(x, y) in the equation for (x’, y’). Insucha case the matrix 
of the linear system corresponding to the difference equation would be 
symmetric. 

We recall that an equation of the form (11.3) is self-adjoint if it can be 
written in the form 


(AU) ACU Yo FU SG (11.23) 


In order for this to be possible, it is clearly necessary and sufficient that we 
have 


D=A, E=C,. (11.24) 


Now the condition (11.22) guarantees the existence of a function y(x, y) 
such that, if both sides of (11.3) are multiplied by (x,y), then the result- 
ing equation is self-adjoint. In our present discussion we assume that 
the process of finding (x,y) has been carried out and that the differential 
equation has been reduced to the form (11.23). 

In deriving the difference equation for the neighborhood configuration 
shown in Fig. 11.2, we use difference representations of the following 
form: 


U, — U, 


5,h 


Uy =U) 


Sah 


— A(x — Vésgh,_7) 


ay 


A(x + ish, 9) ( 
(A U,) eae 


(11.25) 
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and derive a difference equation 
ay U(x + syhy y) + agU(x — sgh, y) + g(x,y + Saft) + org (x,y — sah) 
—aU (x, 7) = t(x, 9) 


where o, = A(x + sh, 9) ae C(x, y + 25h) 
5i(S + 53)” : Sq(5g + 54) 
2, — A = Ash, 7) oC = ssh) (11.26) 
: 53(5) + 53)” : Sa(So + $5)” 


tp = 4, +t, ta, +a, — VR F(x,y),  ¢ = 4RG(x,»). 


We remark that, if there are irregular points in R,, then the matrix 
of the linear system which one obtains will not in general be symmetric. 
This is probably not a serious disadvantage as far as the convergence rate 
of the iterative process is concerned. However, one could obtain a 
symmetric difference equation by using variational methods. We note 
that the problem of solving (11.23) in R with prescribed boundary values 
on S is equivalent to that of minimizing the integral 


{ | (AU,2 + CU,2 — FU? + 2GU) dx dy 
R 


subject to the same boundary conditions. Now, if one represents the 
above integral by an expression involving appropriate sums of difference 
quotients, one obtains a quadratic form. The conditions for minimizing 
this quadratic form lead to Eqs. (11.15) and (11.26) for regular points 
and to somewhat more complicated equations for irregular points. The 
matrix of the system of equations is symmetric. 

The ‘methods just described are by no means the only methods for 
deriving the difference equations. One can, for instance, seek to 
represent the differential operator by a linear combination of functional 
values at neighboring points. By the use of Taylor series, one deter- 
mines the coefficients occurring in the linear combination, in order to 
achieve the greatest local accuracy. Greater accuracy can sometimes 
be obtained by using the fact that the derivatives of the solution of the 
differential equation are related not only by the differential equation but 
by all differential equations obtained by differentiating that equation. 
Another method, based on the use of integration, was used by Varga [43]. 

Here and subsequently we assume that our difference equation can be 
written in the form 


a(x, y)u(x + 5h, y) + ag (x, y)u(x, y + Soh) + x3 (x, y)u(x — Sgh, 7) 
+ a4(x,y)u(x, y — Sqh) — agu(x, 9) = t(x, 9), (11.27) 
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where the a,(x,y) are positive and where 
Oy ay + a + ay + My. (11.28) 


We further assume that the region R, is connected, that is, that any 2 
points of R, can be joined by a broken line consisting of line segments 
connecting adjacent points of R,. 

We now show that (11.27) has a unique solution. To do this, it is 
clearly necessary and sufficient to show that the determinant of the 
related linear system does not vanish. This will be the case provided 
that the homogeneous system, obtained by letting the boundary values 
and ¢(x,y) vanish, has only the trivial solution which vanishes everywhere 
in R,. Suppose the homogeneous system has a nontrivial solution. 
Then for some point of R, we have u #0. We can assume u > 0 at 
this point, since otherwise we could consider (—u), which would also bea 
solution of the homogeneous system. Let M denote the largest positive 
value of u in R,, and let u(x, y)) = M. From (11.27) and (11.28) it 
follows that u must assume the value M at each of the 4 adjacent points. 
Continuing this process and using the connectedness of R,, we conclude 
that u(x, y) = M for all (x, y) in R, and alsoin S,. This contradicts the 
assumption that u vanishes on S, and proves that u vanishes identically 
in R, + S,. It then follows that (11.27) has a unique solution. 


11.4 Accuracy 


It can be shown that, under rather general conditions [9], as # tends 
to zero, the solutions of the difference equation approach the solution of 
the differential equation. However, very little is known about the 
accuracy of the solution of the difference equation for a given value of A. 
Gersgorin [21] has obtained bounds for the maximum error in terms of 
the local error. In the case of Laplace’s equation 


U,, + U,, =0 (11.29) 


for a region containing only regular interior points, we have, for all (x,y) 
in &,, 


lu,(x,9) — U(x,9)| < Yea Myrht, 
where r is the radius of any circle containing the given region and where 


M, = oe max |U,,..|, max \Cvvval| : 
(rWIER+S (zwEeR+S 
Here u, denotes the solution of the difference equation (11.14) corre- 
sponding to the mesh size A. Since the fourth partial derivatives of the 
solution are seldom known unless the solution itself is known, this 
estimate cannot usually be applied. In some cases these derivatives 
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may not even be bounded. On the other hand, one can sometimes 
estimate U,,,, by means of the difference quotient 


Useyy = {u,(x +h, y +h) — Quy(x, y +h) + u,(x — 4, y + A) 
— 2[u,(x + h, y) — 2u,(x,7) + u(x — A, y)] 
+ u,(x +h, y —h) — 2u,(x, » — hk) + u(x — A, y — h)} 
and then estimate U,,,, and U,,,, by the relations 


, O xy = —Unree ae =U Liss 


which can be derived from (11.29). 

Another method of estimating the accuracy of the finite-difference solu- 
tion is to solve the difference equation with two different mesh sizes. If 
the difference between these solutions is small, then one may feel justified 
in assuming that the error is small. If one assumes more about the 
behavior of u, — U, namely, that u, — U is proportional to h?, then one 
can, by computing u, and w,/9, “extrapolate to zero grid size,” using the 
formula 

U(x, y) = % Uy ja(X,_Y) i 4 u,(x, y). 


Although this process is somewhat dangerous, it sometimes gives a 
remarkable improvement in the accuracy. 

Some empirical studies on the accuracy of finite-difference methods 
for solving Laplace’s equation are reportedin [57]. For problems where 
the boundary had no corners with interior angle greater than 180° and 
where the boundary values were continuous, the error was very small, 
and the extrapolation process yielded good results. However, if the 
boundary did havea corner with an interior angle of 270° or if the function 
defining the boundary values was not continuous, then the error was 
much larger, and the extrapolation process was not satisfactory. 

None of the papers which have been written on error estimation (e.g., 
[51, 52, 47, 48, 49, 35, 65}) can be said to come close to satisfying the 
needs of the practical computer user. The field is very much open for 
further research. 


11.5 Point Iterative Methods 


As pointed out in the preceding section, little is known concerning the 
accuracy of numerical results obtained by solving difference equations 
corresponding to elliptic differential equations. Because of this, it 
would appear desirable to use an extremely fine mesh size in order to 
increase the likelihood of obtaining a desired accuracy. However, since 
the number of interior net points increases as h-*, the number of linear 
equations to be solved also increases as h-?._ The problem of solving the 
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difference equation presents a serious practical difficulty, despite the 
fact that the existence of a unique solution is known. 

For linear systems of the size encountered in practice, the use of 
direct methods such as Cramer’s rule, involving the use of determinant, 
and the Gauss elimination method are not practical, except possibly for 
certain very special cases involving rectangles. One is led, instead, to 
consider iterative methods, where one assumes an arbitrary initial 
approximation to the solution and then improves this approximauea 
according to a prescribed procedure. * 

In this section we describe some methods wherein the approximate 
solution is improved point by point. In the next two sections we con- 
sider methods involving simultaneous changes at groups of point. 
We present here, without proof, descriptions of the methods as well a 
rules and suggestions for their efficient use. Later, we present some of 
the underlying theory. 

If we solve (11.27) for u(x, ¥), we get 


u(x, 9) = Byu(x + Ay) + Bgu(x — Ag, 9) + Boule,» + hy) 
+ Puls y — hy) + t0G 4), (130 


where r= — Has) 
Xp 
and where, for: = 1, 2, 3, 4, 
B.=—, 3G 
Xo 
h, = hs,. Cee 


For our present discussion it is sufficient to know that the £, are all 
positive and that 


> 6 <1. (11.33 


The usual procedure is to compute all the f; initially and to store 
them in the relatively slow-speed auxiliary memory of the computer, 
such as is provided by magnetic drums or tapes. This avoids the neces- 
sity of recomputing the #, for each iteration, which would be a wasteful 
process, and at the same time saves the high-speed memory for other use. 
Large blocks of the 8, can be read into the high-speed memory conven- 
iently when needed. Ofcourse, if the differential equation is sufficiently 
simple, 1t may happen that many or all of the f, are identical. In such 


* With relaxation methods, which are more suited for hand computation than fer 
machine use, one continually improves the approximate solution, varving the 
procedure from time to time in accordance with the judgment of the human computer. 
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cases a different scheme should be used, since it would be inefficient to 
store all the £.. 

Perhaps the simplest of the point iterative methods is the Gauss- 
Seidel method, also known as the method of successive displacements, where, 
starting with arbitrary initial values u(x, y) for all points of R,, one 
improves the values at points of A, in an arbitrary but fixed order, using 
(11.30) and using improved values as soon as available. Thus, if we 
consider the ordering where (x,y) follows (x’,»’), provided that either 
J} >» ory = y’ and x > x’, then the improvement formula is 


ul" D(x, 9) = Bu” (x at h,,¥) ie Bau” tD(x =e hs, ¥) + Bou” (x,y + hg) 
As Burt (x, y eee h,) + T(x, 7). (11.34) 


A complete iteration consists of improving the approximate values at all 
points of R,. Having traversed the points of R,, one starts over again at 
the “first”? point and repeats the process until d, < e, where ¢ is a 

prescribed tolerance and where 
d, = max |u'"’(x, y) — u'"~ (x, )|. (11.35) 

(z.yeR, 

We remark that in a similar method, known as the Jacobi method or the 
method of simultaneous displacements, one does not use improved values until 
aftera complete iteration. ‘The improvement formula for this method is 


u"* D(x, y) a Bu (x +hy,y) + Bau (x — hs, y) + Bou” (x, y + hg) 
+ Baul (x,y — hy) + 7(x,9). (11.36) 


In Sec. 11.8 we show that the Gauss-Seidel method converges exactly 
twice as fast as the Jacobi method, where the rate of convergence can be 
defined in a mathematically precise way. 

As an example of the Gauss-Seidel method, let us consider a problem 
involving Laplace’s equation 


U..+ U,, =0 (11.37) 


for the unit squareO <x < 1,0 <y <1 withh = M-', where Mis an 
integer. For boundary values let us assume that u = 1000 on the side 
ofa square where y = 1 and that u = O elsewhere. Evidently we have 
B= Bo = Bg = Bg =%, hy = hp =hzg=h,y=1, and +=0. Table 
11.1 illustrates the computational procedure for the case M = 3. 
The initial approximation to the solution was assumed to be zero for all 
4 interior net points. The ordering of the points was (3,14), (24,!3), 
(44,4), (74,34). 

After seven complete iterations the values converged. For conven- 
lence, the d™ = u' — u™ | are given as well as the wu”. The ratios 
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TABLE 11.1 AN EXAMPLE OF THE USE OF THE GAUSS-SEIDEL METHOD 


r= x= 24 
n ulin) qi”) n yi?) d‘”) 
y=% 0 0 0 0 
] 250 250 l 312 312 
2 344 94 2 360 48 
3 368 24 3 372 12 
4 374 6 4 375 3 
5 375 ] 5 375 0 
6 375 0 6 375 0 
7 375 0 7 375 0 
v=14 0 0 0 0 
l 0 0 ] 0 0 
2 62 62 2 94 94 
3 110 48 3 118 24 
4 122 12 4 124 6 
5 124 2 5 125 ] 
6 125 l 6 125 0 
7 125 0 7 125 0 


d/d~ give an indication of how rapidly the errors are decreasing. 
In this case it can be seen that the ratios are approximately 34; hence 
the convergence is rapid. However, the general formula for 4, the 
limiting ratio, is 
ee ee ee eee 
A = cos uM I (=) (11.38 
(see, e.g., [55]). We refer to A as the spectral radius* of the Gaus: 
Seidel method. Thus, for instance, if M = 20, we have A = .975, and 
the convergence is very slow. 
By asimple modification of (11.34) we can make a substantial improve- 
ment in the rate of convergence. We use the following formula: 


ul™*Y(x, yy) = w[B u(x + Ay) + Byul™* (x — hs,y) 
+ Bou (x,y + hg) + Byu™?? (x,y — Ay) + 7(x,y~)] -— (w — Dulin. 
(11.38 
Here w is a parameter known as a relaxation factor, the choice of which 
determines the rate of convergence of the method. Evidently, when 
«w = 1, the method reduces to the Gauss-Seidel method. 
The method defined by (11.39) 1s known as the successtve-overrelaxatian 
* Actually A is the spectral radius of the matrix of the linear transformauon 
associated with the Gauss-Scidel method. In general, as in [55], where the term 


“‘spectral norm’’ is used, we define the spectral radius of a matrix as the maximum o! 
the moduli of its eigenvalues. 
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method (see [55]) and, when applied to Laplace’s equation, as the 
extrapolated Liebmann method (see [16]). Its use will be illustrated by 
applying it to the example involving Laplace’s equation which was 
considered previously in connection with the Gauss-Seidel method. 
Table 11.2 shows the results obtained using a relaxation factor of 1.1. 


Taste 11.2. AN EXAMPLE OF THE USE OF THE SUCCESSIVE-OVERRELAXATION 


METHOD 
r= 1% x= % 

n yin) d™ n yi” dn) 
y=% 0 0 0 0 

l 275 275 1 35] 351 

2 365 90 2 373 22 

3 376 11 3 376 3 

4 376 0 4 376 0 
y=% 0 0 0 0 

l 0 0 ] 0 0 

2 76 76 2 118 118 

3 126 50 3 126 8 

4 126 0 4 126 0 


We note that convergence was achieved after four iterations. The 
converged values are slightly in error because of the fact that only 3 
significant figures were retained. The erratic behavior of the ratios 
d/d"-) should be noted. 

Actually, the best value of w would have been 1.072. The general 
formula for the best value is 

Oy, = l ee (11.40) 
(l + v1 — A)? 
Where 4 is the limiting value of d,/d,_, for the Gauss-Seidel method. 
The value of 1.072 for our example is readily obtained by substituting 
4 = into (11.40). It is possible to compute A exactly for rectangles 
as well as for squares [see (11.38) for the square]. It can be shown that 
the convergence of the successive-overrelaxation method with the best 
ane w 1s approximately as rapid as it would be if d,,, ,/d, tended to 
where 
Qa 
A* =m —-l~l]l — M 
Here 4* is the spectral radius of the successive-overrelaxation method. 
In the case M = 20, we get w = 1.73 and hence 4* = .73, compared 
with 4 = .975 for the Gauss-Seidel method. The rapidity of convergence 
lor the successive-overrelaxation method is thus seen to be very much 
larger than for the Gauss-Seidel method. 
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Some rules and suggestions for choosing A are given in [57], where 
numerical experiments involving the use of the method are also described. 
If, as in the case of Laplace’s equation, one can determine A exactly fora 
general rectangle, one should do so for a rectangle which wholly contains 
R +S. The w determined in this way will be larger than the best value. 
This is fortunate, since the theory indicates that it is much better to over- 
estimate than to underestimate w. Inspite of this, it may be better to use 
a somewhat smaller rectangle which has the same area as the given region. 
Another method of choosing A is to perform a number of iterations using 
w = | and to attempt to estimate the limiting value of the ratio d,/d,_,. 

Two other point iterative methods should be mentioned. The first is 
due to Richardson and is described in [33]. The convergence of the 
method has been discussed in [56, 37, 28]. Although Richardson's 
method as a method of solving linear systems is of wider applicability 
than the successive-overrelaxation method, it is considerably less effective 
for linear systems arising from elliptic difference equations (see the dis- 
cussion in [55, 56, 61]). We discuss Richardson’s method again in Sec. 
11.18. 

Another very interesting method, presented by Sheldon [36], is one 
which combines ideas used in Aitken’s “to-and-fro”’ method [1], the 
successive-overrelaxation method, and Richardson’s method. Pending 
the development of a rigorous analysis of the convergence rate of the 
method, at least in some special cases, it appears doubtful that the method 
will be used as much as either the successive-overrelaxation method or the 
alternating-direction implicit methods, which are discussed in Sec. 11.7. 


11.6 Line Iterative Methods 


With line iterative methods one improves the values of the approximate 
solution simultaneously on anentirelineofpoints. Theiterative formula 
for successive row iteration is 


ulrtti(x, y) = Bul tx + Ay, yy) + Bgut (x — hg, 9) 

+ Boul (x, vy + hy) + Bul U(x, y —Ay) + r(xyy). 1.4L 
As indicated, the rows are improved in the order of increasing y. For 
simultaneous row iteration the improved values are not used until the end ofa 
complete iteration. The improvement formula is 
Bes.) — Burt (x + h,, e)) + Baur (x aa hs, y) 

oe Baul” (x, yt hy) + Bul (x, = h,) ae T(x, 9’). (11.42: 

For each y the determination of the improved values involves the 


solution of a system of M linear equations with Mf unknowns, where fis 
the number of interior net points on the givenrow. Fortunately, because 
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the matrix of the linear system is tridiagonal, that is, has no nonzero ele- 
ments except on the main diagonal and on the diagonals adjacent to the 
main diagonal, there is a very simple algorithm for solving the equations. 
This algorithm, which is easily derived by using the Gaussian elimination 
method, involves a number of arithmetic operations which 1s proportional 
to M, as opposed to the approximately 43M? operations required with a 
general matrix. ‘The algorithm appears to have been presented first by 
Thomas [41]. It was first used in connection with parabolic partial 
differential equations by Bruce, Peaceman, Rachford, and Rice [7]. 
We describe the procedure for the linear system 


Bil, + C,T, = Dy, 
AT DRT EOT oD. G2 OB cc. APT, WIAD) 
AyTy-) + ByTy = Dy, 


where the 4,, B,,C,, D, are given and where the 7; are to be determined. 
The solution can be determined by the following formulas: 


C, G, ‘ 
1 = BR oe Ay (S253; e66g 4-1); 
D, = D; — Aiqi-1 (11.44) 


== nn = eee a 
71 B, 3 qi B, = A,b,_, (2 2 3; ) 1), 


Ty = Wp P= 9; — 6:7 ;51 (2 = M—1,M —2,..., 1). 


Successive line overrelaxation can be defined in an obvious way as 
follows: 


u(x, 9) = W(x, 3) $ oof a™O(X, 9) — w(x, 9) ], (11.45) 
where 


G1 (x, y) = By (a + bys p) + Bal ( — hy, 9) 
+ Baia 9 + Ita) + Baul (x,y — hg) + 1(,3)- (11344) 


We shall see in Sec. 11.10 that the rates of convergence of the three 
methods of row iteration which we have just defined bear the same 
relation to each other as do the corresponding point iteration methods. 
Thus, for instance, the rate of convergence of the method of successive 
row iteration is exactly twice that of simultaneous row iteration. 
Furthermore, the optimum value of w is given by 


A 
OO, = l a ————— er 
(l + V1 — A)? 


where A is the spectral norm of the method of successive row iteration. 
Let us now compare the point iteration methods for the case of 
Laplace’s equation in the unit square with h = M~—!. It is not difficult 


Google 


396 SURVEY OF NUMERICAL ANALYSIS 


to show (see, e.g., [3]) that 


cos 7h \? 

A= (<=, ~ 1 — Qn2h?, 
2 — cos wh 

compared with 1 — 7A? for successive point relaxation. Thus succes- 

sive row iteration converges approximately twice as fast as successive 

point iteration. Later, in Sec. 11.10, we show that the method of suc- 


cessive row overrelaxation converges approximately 2 times as fast as 
successive point overrelaxation. At first sight, this relatively small 
improvement would not seem to be worth the extra effort involved in 
using (11.44) to perform the row iteration. However, Cuthill and Varga 
[64] showed that it is practical in some cases to compute and store once 
and for all certain coefficients, such as the coefficients of the inverse of the 
tridiagonal matrix, which can be used to carry out the line relaxation 
process much more readily than would be possible with the algorithm 
defined by (11.44). As a matter of fact, using the procedure of Cuthill 
and Varga, the computational effort per point is the same for block 
relaxation as for point relaxation. 


11.7 Alternating-direction Implicit Methods 


Two methods which are somewhat similar to line relaxation have been 
developed recently. One, which was presented by Peaceman and Rach- 
ford [32], is related to a method developed by Douglas [11] for solving the 
equation u, = u,, + u,,. In Sec. 11.17 the relation between Douglas's 
method and the method of Peaceman and Rachford is discussed in more 
detail. : 

Douglas and Rachford [12] presented a method rather similar to that 
of Peaceman and Rachford. The former method can be generalized to 
three dimensions, whereas the latter apparently cannot. Nevertheless, 
since the Peaceman-Rachford method is superior to the Douglas-Rach- 
ford method in certain elementary cases and since the two methods and 
the analysis of their properties are quite similar, we discuss only the 
Peaceman-Rachford method here. For convenience, we consider 
Laplace’s equation and a uniform mesh with h, = hy = hy = Ay =A. 
The basic formulas are 


ul (x y)] + rfu (x, 9 +A) + Ula, — A) 
— 2u™(x,9)], 


uD (x,y) = ult! (x,y) [uO (x thy) + ult!O(x — Ay y) 
= Qul"+'4) (x, y)] ae r[ul™t+) (x, » + h) + ulMtl(y, y as: h) 
ae Quit (x, y) ]. 


(11.46) 
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We note that the first equation of (11.46), with r = %, defines a 
process which is very similar to but which is not the same as simultaneous 
row iteration. ‘The Peaceman-Rachford procedure is essentially a kind 
ofa row iteration followed by a columniteration. Although r need not 
be held fixed, it is important that the same value of r be used for both 
parts of an iteration. The choice of the values of r is discussed in [32] 
and [45]. 

We describe the use of the method for the case of a unit square with 
mesh size h, following the results given in [45]. Starting with a positive 
integer ¢, whose selection is discussed later, determine z from the relation 


z = giltt-) 


a 
h eafa 
where o 5? 
ah 
b = 4cos? —, 
cos 5 
ah 


a = 4sin? 7° 
The 7, are then determined by* 


] 
Tk ~ FoR? ce ee ee are 
According to [45], the factor of reduction of the error achieved after ¢ 
double sweeps is approximately, for small a, 


Since the average factor of reduction per double sweep is S, = (P,)™*, 
one should choose ¢ so as to minimize S,. Because of the complicated 
nature of S, as a function of ¢, it is probably best to determine the best 
value of t by computing S, for a number of different values of t. 

As an example, let us consider the case h = 20-1. Here 


a —4sin®? — = .024623, 


40 
b = 4 cos? = 3,97538 
= 4C0s* — = 3; 
40 
a 
SS .0061940. 
* Wachspress reports, in a private communication, that it is preferable to use the 
in the order r, = z§—1/a,k = 1,2,...,¢. 
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We compute §S, for various values of ¢t, obtaining 


S, = 3152S, = 3056 
Sy = .3060 Sia = .3064 


The optimum value for S, is .3056, which is assumed for ¢ = 9. The 
corresponding values of 7, are 


r, = .25155 6 6.03450 
i 47492 rT, = 11.39320 
Tr, = .89666 rg = 21.51045 
r, = 1.69291 rg = 40.61191 
r, = 3.19623 


For the successive-overrelaxation method with the optimum value of 
w, namely, 1.73, the average factor of reduction of the error is .73, as 
compared with .3056 per double sweep for the Peaceman-Rachford 
method. In general, it can be shown that the number of iterations 
needed to achieve a specified accuracy varies as |log 4|~! for the Peaceman- 
Rachford method, which is much more favorable than the 4! required 
for the successive-overrelaxation method. Thus, decreasing h by a 
factor of 2 in the above example would double the number of iterations 
required for the successive-overrelaxation method, whereas there would 
be an increase by only a factor of |log }0|/|log 140] = 1.23 for the Peace- 
man-Rachford method. 

In spite of the apparent advantages of the Peaceman-Rachford 
method, there are several reasons why one might hesitate to use it for 
some problems in preference to successive overrelaxation. The latter 
method is undoubtedly simpler. Not only are the basic formulas for the 
Peaceman-Rachford method considerably more complicated, but there 
is also a problem of obtaining the necessary data first by columns and 
then by rows, particularly if this information is stored on tape. The 
Peaceman-Rachford method may well be better for sufficiently small A, 
but it is not clear whether, for a given case, it will be better for the 
particular value of A being used. Moreover, although the theory 
underlying the successive-overrelaxation method has been extended to 
include a wide class of partial differential equations, including Laplace’s, 
and to include nonrectangular regions, the theory for the Peaceman- 
Rachford method is limited to problems involving a very restricted class 
of partial differential equations and the rectangle. Generalizations of 
the Peaceman-Rachford method are discussed further in Sec. 11.12. 


| 
| 


11.8 Theoretical Discussion of the Successive-overrelaxation 
Method 


In this section we give a sketch of the underlying theory on which 
the analysis of the successive-overrelaxation method is based. The 
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discussion here is not intended to be either as detailed or as rigorous as 
that givenin [55]. Instead, a simplified and more intuitive discussion 1s 
presented. ‘The simplified presentation is based, in part, on conver- 
sations with Prof. B. Friedman and on [17]. 

It is convenient to consider the following linear system: 


Sa,u,+d,=0, i=1,2,...,%, (11.47) 


j=l 


* 


where N is the number of points of R, and where 


a, ; > 0, be V2) exaglys (11.482) 
N 
Ge > lec) and for at least one: the strict inequality holds. 
a} (11.485) 


The matrix A 1s irreducible: given any two nonempty disjoint subsets 
Sand T of the set W of the first .v positive integers such that § + T = 
W, there existsa, ; 4 Osuchthatz e Sandj e 7. (11.48c) 


The matrix A has property A: there exist two disjoint subsets Sand 7 
of W, the set of the first N positive integers, such that S + T = W 
and such that, ifa, , # 0, theneitherz: = yori e Sandj ¢ Torrie T 
andj eS. (11.48d) 


Condition (11.48c) was formulated by Geiringer [20] and by Frobenius 
[18]. Condition (11.48d), which was formulated in [55], is equivalent 
to stating that the matrix A, by a suitable permutation of its rows and the 
corresponding columns, can be written in the form 


D, OF 
f= ) (11.49) 
G D, 


where D,, D, are square diagonal matrices and where the rectangular 


_ Matrices G and F are arbitrary. Actually, a matrix with this property 
_ belongs to the class of p-cyclic matrices studied by Frobenius [18] and 


~ Romanovsky [34] (see Sec. 11.11). It is easy to show that conditions 


~ (11.48) hold for any linear system arising from the difference equation 


(11.27) provided that the a, are all positive and that (11.28) holds (see, 


_ @g., [55]). We remark that the terms of (11.27) involving boundary 
' Values are absorbed into the d, of (11.47). The nonzero coefficients a, , 


= 


_ with 1 #7 correspond to the a,(i = 1, 2, 3,4) of (11.27), and the 


diagonal elements a; , correspond to % in (11.27). That (11.48c) holds 
follows from the fact that the region R, is assumed to be connected. 


,; Condition (11.48d) follows from consideration of a “coloring” of the 
| points of R, with two colors in such a way that every pair of adjacent 
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points of R, has different colors. It is easy to see that such a coloring is 
possible. Inorder to obtain the form (11.49) it is necessary only to label 
the points in an ordering so that all the points of one color occur before 
any point of the other color. For it is clear from (11.27) that the differ- 
ence equation for a point of one color only involves values of u at points of 
the opposite color. 
We may write (11.47) in the form 
Au + d= 0, (11.47) 


where A is an N x N matrix and where u and d are column matrices. 
It is convenient to consider the matrix B defined by 

B= —-D-‘, (11.50) 
where A=D+C. (11.51) 
Here D is a diagonal matrix, and the diagonal elements of C vanish. 
Evidently the diagonal elements of B vanish, and we may write 

B=L + U, (11.52: 
where L and U have no nonzero elements above and below the main 
diagonal, respectively. We may write 

u= Buse, (11.47°) 
where ec = —Dd. 
We remark that B is the matrix corresponding to the Jacobi method of 
iteration, discussed in the previous section. 

It is easy to show that, if A is symmetric, then B, though not necessarily 


symmetric, is similar to a symmetric matrix. Consider the matrix 
D'*BD~-", which is similar to B. We have, by (11.50), 


D4BD-% = —D-“CD-, (11.53 


which is symmetric, since C is symmetric and D~-t is diagonal. 

We next show that, if uw is an eigenvalue of B, then so is —y. Since 
the eigenvalues of B do not depend on the ordering of the rows and 
corresponding columns, we can assume that A is in the form (11.49. 
with D, and D, of order rand 5, respectively. Evidently B has the form 


0, | 
Be ( (1.5¢ 
K O 


where O, and O, are square null matrices of order r and s, respectively. 
If wis an eigenvalue of B, we have 


—pl, H 
K —pl, 


where J, and J, are square identity matrices of order r and s, respectively. 


det (B — pl) = = 0, 
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If we multiply the first r rows and the last s columns of the determinant 
by —1, the equality is preserved, and we obtain 


al, 
K pl, 


Hence —, is an eigenvalue of B. 

Condition (11.48d) shows us that, by starting with A and by rearrang- 
ing, if necessary, the rows and the corresponding columns of A in a 
certain ordering, we may obtain the form (11.49). More generally, 
we say that any ordering of the rows and columns of A is consistent if, 
starting from the ordering, we can permute the rows and corresponding 
columns in such a way that if a,; 4 0, the ordering relation between 
the zth and jth rows is unchanged and so that the matrix has the form 


= det (B + ul). 


DF O ... O 
G, D, F, O 
O G, Dy... O 
A= 0 O G... , O 1, = (11.55) 
OF QO? OQ) x. ve. Gig. Deg. Bye 
0 0 O.. 0 G D 


where D,; is asquarer,; < 7; matrix and where the F, and G, arer,; X 1,41 
and 7;,1 X r, rectangular matrices, respectively. ‘That such orderings 
exist follows from (11.49). The matrix B likewise has the form 


0, H, O ..0 O 


K, 0, H, ..0O O 
B=10 kK, 0, ..0 O 
0 00... K,, 0, 


The method of successive overrelaxation is defined by the formula, 
given in [595], 
i-1 N 
uth = of yO: ues + DO, us + | =o da 
j=l j=i+l 
or, in matrix notation, 
uD = [Lue 4+ Cu™ + ¢] — (wo — I)lu™. 
If we let : 
L, = I — oL)3[oU — (w — 1)7], 


Go gle 


402 SURVEY OF NUMERICAL ANALYSIS 
we have 
art) = Leu™ + (I — wl) we. 


We remark that the nonsingularity of (J — wl) follows from the fact 
that its determinant equals unity. 
To study the rate of convergence, we consider 


e(”) —_ ui” _—- u, 


where uis the true solution of (11.47’). Sinceu = Liu + (J — wl)" we 


for any w, we have 
eee OF aa (11.56) 


The rapidity of convergence depends on the largest eigenvalue of the 
matrix L,. We seek to relate the eigenvalues of L, to those of the 
simpler matrix B. Suppose w > 0 and Aisa nonzero eigenvalue of L,. 


Then 
O = det (L, — Al) = det {J — wl) [wU — (w — 1)7] — Ad 
0 = det [wU — (w — 1)I — AI — wL)] 


esas [(w LaDy (-=2—") 71. 


If the ordering is consistent, then we may assume A has the form 
(11.55) since the permutations of the rows and columns necessary to 
obtain this form clearly do not affect the u'. Therefore the matnx 


lu + AL — (2 1] may be written in the form 


—al, Ay, O Scien er YO O 
AK, -—al, AH, : a WO O 
O AK, -—al, O O , 
O O O . . . AK, —al 
where for convenience we have set a = (A + w — 1)/m. Let* 
I, O O 
O Atl, O 


r=10 0 4a, 


(p—1)/2 
qir- wey 


* The matrix I was used by Friedman [17]. One could also obtain explicit 
expressions for the eigenvectors of ZL, as given in [53]. 
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It follows that 

— ] | A — | 
{y+ ag — (ATS) yr a [aque y - (2 )r], 


62) Ww 


and we have 


_ A+w— ) 
wi? 
* 
But since det (B — zl) = [J (u; — »), 
i=1 


where the mw, are the eigenvalues of B, we have 


i ane )] 
: = ( wh 


Thus, if 2 is an eigenvalue of L,,, then for some eigenvalue of B we have 


Ato-—l | 
wa” mane? 


_ Itcan also be shown that, if uw is an eigenvalue of B and if A satisfies 
(A+ —1) = pod’, (11.57) 


then A is an eigenvalue of L,, (see [55]). We remark that, since the 
eigenvalues of B are clearly independent of the ordering chosen, the 
eigenvalues of L,, are the same for all consistent orderings. Of course, 
the eigenvectors will in general be different, and this may have some 
effect on the time required to obtain convergence in practice. 

Let us now assume that A issymmetric. Hence the eigenvalues of B 
are real and occur in pairs —y, w. Let @ denote the largest eigenvalue 
of B. By considering the mapping between the yp plane and the A 
plane defined by (11.57), it can be shown (see [55]) that w, defined by 


w,*4? = 4(w, — 1), (11.58) 
1 <w, < 2, 
or, equivalently, 
On = l + (<a) = eee eee (11.58°) 
l+v1 — £ l+vi1— # 


., ithe optimum value of w in the sense that the spectral radius of L,,, is 
less than that for L, with w #w,. Surprisingly, for w in the range 
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Fic. 11.3 


2 >a > vw, all the eigenvalues of L,, have the same modulus, namely, 
w —1; hence 


A(L,) =o — 1, (2 >w >.o,). (11.59) 
7 wes ew pe 

Consequently, (L,,) = (<a) - ee 
l+vVl—@/ l+vil— -# 


In order to compare the effectiveness of the methods which we have 
considered, we introduce the concept of the rate of convergence, R( A), ofa 
method whose matrix is A. This is defined by 


R(A) = —log A(A), (11.61) 


where, here and elsewhere, we let 4(A) denote the spectral radius of A. 
The motivation here is that the number of iterations necessary to reduce 
an initial error by a prescribed factor is, approximately, inversely 
proportional to the rate of convergence. 

If w = 1, the method of successive overrelaxation reduces to the 
Gauss-Seidel method. By (11.57) we have 


A(Ly) = [A(B) 8; (11.62) 
hence R(L,) = 2R(B). (11.63) 
Thus the rate of convergence of the Gauss-Seidel method is just twice 


that of the Jacobi method. By the use of (11.60) it can be shown (see 
[55]) that, asymptotically as A(B) tends to zero, 


R(L,,) ~ 2V R(L)). (11.64) 


It can be seen that the gain in the rate of convergence is very large in 
cases where the Gauss-Seidel method is slow. 

The statements made in Sec. 11.5 concerning the effect of overestimat- 
ing and underestimating w can be shown by the use of (11.57). 

As an example, let us consider the linear system arising from the usual 
difference-equation analogue of the Dirichlet problem for a rectangle 
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with sides a = Mh, b = Nh, where his the mesh size and where M and NV 
are integers. The difference equation is 


Fey 5 Misg — Mina — Majer — Miyj- = 0, b= 1,2,...,M@ —-1; 
Is 2 oe gi hy 
(11.65) 
Here u; , = u(th,jh). We assume that the u;, and u;y (2 = 1, 2,..., 
M — 1) are givenand that the up ; = uy, j( 7 = 1, 2,..., N — 1) aregiven. 
The (point) Jacobi method is defined by 
WD = Yad, +l, +l, + u_,). (11.66) 


Let ¢{") = uj’) — u;,, where u,;; is the true solution of (11.65). Evi- 
dently, e(nt1) ‘and e'") satisfy (11. .66) but vanish on the boundary. Ifwe 
write (11.66) in the form 

2(ntl) — Beln) 


where e” is a column matrix with elements e\””, we are led to seek the 


eigenvalues woof B. Evidently, for p,q integers and for 


paih . qajh 
j= = sin —— si ae 
we have 
Bu = pv, 
\ | 
where —_— ¥4(cos a + cos om) , 


Clearly “4 = A(B) is given by 


h wh 
_ 1 TT 
i= ¥4(cos = + cos ). 


The (point) successive-overrelaxation method is given by 
unt) =_ (n+1 (n+1 ( ( 
= o[a(ul” i uid a ul” ae ui” ae) ~~ (w ss 1) ui", He 


Here the ordering is (1,1), (2,1),..., (Mf — 1, 1), (1,2),..., (Af — 1, 2), 
etc. The optimum value of w is given by 


fi \2 
° l+vi- =) 
Ifa = 6 = 1 and his small, we have 
nh? 
A= = cos7h ~ 1 — 2. 
and, by (11.63), 
R(L,) ~ 2R(B) ~ w*h?. 
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By (11.64) we have 
R(L,,) ~ 2W R(L,) ~ 2ah. 


Thus the successive-overrelaxation method converges approximately 
(2/7h) times as fast as the Gauss-Seidel method. 


11.9 Extensions of the Generality of the Successive- 
overrelaxation Method 


The question naturally arises whether the results obtained for the 
successive-overrelaxation method might hold under somewhat weaker 
assumptions than were made in Sec. 11.8. That the results do not hold 
in general is illustrated in [61], where a problem involving a positive 
definite matrix is considered and it is shown that the successive-over- 
relaxation method is much less effective than would be the case if the 
overrelaxation theory did apply. On the other hand, numerical experi- 
ments described in [57] indicate that the successive-overrelaxation 
method is about as effective when applied to the usual 9-point finite- 
difference equation corresponding to Laplace’s equation as when applied 
to the 5-point equation. Since the matrix corresponding to the 9-point 
equation, 


Alu(x + h,y) + u(x — A, y) + u(x, y +h) + u(x, y — AS] 
+u(x +h,y +h) + u(x —h,y + A) + u(x + yy — fh) 
+ u(x —h,y —h) — 20u(x, 7) = 0, (L}.67: 


does not have property A, it appeared that at least some weakening of the 
assumptions would be possible. 

Garabedian [19] has developed a method for obtaining estimates for 
the rate of convergence of the successive-overrelaxation method as ap- 
plied to the 5-point and to the 9-point difference-equation analogues of 
the Dirichlet problem for the rectangle. For the 5-point formula the 
result agrees with that obtained by the use of the overrelaxation theory. 
Indeed, for the unit square the formula for w obtained from Garabedian’s 
method is 


2 
1 + ah’ 


( = 


whereas using (11.40), with 2 = cos? wf, gives 


14 ( cos wh ) 2 
i) = —_—_—_—_—_—_— ] ~ ———__ 
1 + sin wh, 1 + rh 


ee gle 


ELLIPTIC AND PARABOLIC PARTIAL DIFFERENTIAL EQUATIONS 407 


On the other hand, for the 9-point formula we obtain, by Garabedian’s 
method, 
2 2 


OF ,rT_ol*_—__—_"— ~~ s-=_:___ ’ 
1 + (V26/5)rh 1 + 1.02nh 


which is slightly less than for the 5-point formula. This result, which 
does not appear to have been obtained by any other means, agrees with 
the results of [57]. 

The method used by Garabedian involves relating the iterative for- 
mula for the successive-overrelaxation method to a certain hyperbolic 
partial differential equation and studying the latter equation. The 
technique is illustrated by a simple one-dimensional example in Sec. 
11.18. 

Kahan [27] has investigated the rate of convergence of the successive- 
overrelaxation method for cases where the matrix A does not have prop- 
ertyA. He assumes instead that the diagonal elements of A are positive 
and that the other elements are nonpositive. He proves a number of 
interesting results without making further assumptionson A. However, 
probably the most interesting and useful results apply for the case where 
A is symmetric and positive definite. Kahan shows that there exists w, 
such that, asymptotically for small values of R(Z,), 


VR(L,) < R(L,,) < 2VR(L,). 
Here R(L,) and R(L,,) are the rates of convergence, as defined in Sec. 
11.18, for the Gauss-Seidel method and for the successive-overrelaxation 
method, with w = w,, respectively. Thus, as when property A holds 
[in which case R(L,,) ~2V R(L,)], the successive-overrelaxation 


method converges much more rapidly than the Gauss-Seidel method. 
The best value of w does not differ very much from 


A(L,). 
(1+ V1 — a(L,))? 
where, as before, we let A(L,) denote the spectral radius of ZL, Kahan 
gives suggestions for a more effective determination of the best value of o. 
11.10 Block Iteration Methods 


In point iteration methods one essentially solves each equation for a 
certain unknown, using approximate values of the other unknowns and 
possibly also using overrelaxation. Natural extensions of these methods 
are the block iteration methods which are discussed in [3] and [17]. 
Here one divides the equations into asetofgroups. The basic step inthe 
block iterative process is to solve the subsystem of equations belonging toa 
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given group for the corresponding unknowns, using approximate values 
for the other unknowns. For instance, suppose we consider the linear 
system ) 


> a, ju; +d; = 0, i=1,2,...,m, (11.68: 
jel 
or Au + d=0O, (11.687, 


and assume that the equations and unknowns are partitioned into + 
groupssuch that u,,u,...,u,,, belong to the first group, Uy .1) Win, —20-+ 3 
u,,, belong to the second group, and, in general, u,, m_, <t < m, belong 
to the Ath group. (Here for convenience we let mp = 0.) Evidently, 
we may subdivide A into blocks. 


A, Ay» 0 Ay 


A= |“? 20 NT (11.69. 


Ay Ay. i Ay.» 


where A, , is an (m, — m;_,) X (m; — m,_,) rectangular matrix. 
Simultaneous block iteration may be defined by the following equa- 
tions: 


Dut) + Cu +d =0 (11.70; 
om uirtt) — Bu™ + ¢, (11.707) 
where B= = DC (11.713 
and C= =D. (11.72: 
Here A = D + Cand 
1 °O O 
pape. fae 2 (11.73: 


The method of successive block overrelaxation is defined by 


wD = of Le 4 Cu 4) —(@ — Vu” (11.74 


or ured — J yi 4 ff (11.74% 
where L, = (= of)ol — (w= 1} (1.75: 
and f= UZ -— oL) ov. (11.76: 


Here B = L + U, and L and U have no nonzero elements above and 
below the main diagonal, respectively. 
In order to relate the eigenvalues of B to those of L,, we introduce, 
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following [3], the concept of block property A. A matrix A which has 
been partitioned in the form (11.69) has block property A if by a suitable 
rearrangement of the rows and corresponding columns it can be put into 


the form 
D, F 
A= ; (11.77) 
G D, 


where D, and D, are matrices with matrix elements whose diagonal 
elements are nonsingular square matrices and whose other elements are 
null matrices. F and G are rectangular matrices whose elements are 
rectangular matrices. 

An ordering of the rows and corresponding columns of blocks of A is 
consistent if A has the form 


D, F, O 0 O 
G, D, Fy 0 O 

Ha Ce. Hs rn (11.78) 
0 0 0 Dey Fra 
0 0 0 ++:'G4 D, 


where D, are matrices with matrix elements with square matrices on the 
main diagonal and null matrices elsewhere and the F, and G, are rectan- 
gular arrays of matrices. Again, as in Sec. 11.8, we allow the permu- 
tation of rows and columns of A provided that if A; , 4 0 the ordering 
relation between the :th row and the jth row is preserved. 

It is easy to show that the matrix B defined by (11.71) has (point) 
property A. Since L,, bears exactly the same relation to B here as was 
the case with point relaxation, all the results given in Sec. 11.8 hold. 
Thus, for instance, if @ = A(B), the optimum value of w is given by 


fi 2 
l+v1— 2 
and as A(B) tends to zero, we have 
R(L,,) ~ 2VR(L,) = 2V2VR(B). (11.80) 


To illustrate the foregoing, let us consider the application of line itera- 
tion to the example which was treated in Sec. 11.8. We consider blocks 
each consisting of all the points on a row. Methods based_on the use of 
such blocks are called line tteration methods and have been discussed in Sec. 
11.6. : 
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For simultaneous line iteration, we have, by (11.65) and (11.42), 


eS Nig Decco Me FS tg NS. eT 


Even without solving the above equations we can find the eigenvalues cf 
the matrix B associated with the method of simultaneous line iteration. 
Letting ¢") = ui") — uj, where u,; is the true solution of the Laplace 
difference equation, we observe that e+) and e satisfy (11.81: anc 
vanish on the boundary. We write (11.81) in the form 


emt) — Belm tse 


(n+1) _ 1 (n= 1) (n =1) (n) (n) 
ur; oe Ma (ul wit Mag 1 Yyea 7 ui”) 1) 


where e is a column matrix with elements ¢’). Evidently, for ¢.; 
integers and for 


path Gut Pi a 
v;; = sin aad ae ‘ (11.83, 
we have 
Bu = pov (11.04 
h/b | 
where = mea é (11.85 
2 — cos (pzh/a) 
Ifa = 5 = 1, we have, for the spectral radius of B, 
cos wh | 
i = ———_—- ~ 1 — 7°/? f11.86 | 
Me 2 — cos wh mea os 
and the rate of convergence of successive line relaxation is 
a cos 7h J wi — 20th ans 
oes Cepre i etien, oh 
so that R(L,) = 2R(B) ~ 2n?h?, | 


which is twice as large as for the method of successive point iteration, tha: 
is, the Gauss-Seidel method. 
By (11.80) we have, for the method of successive line overrelaxation. 
R(L,,) ~ 2V R(Ly) ~ 2V 2h. ' 
Thus, as stated in Sec. 11.6, the rate of convergence of the method of 


successive line overrelaxation is approximately V2 times that of succes- | 
sive point overrelaxation. 0 


11.11 p-cyclic Matrices 


It was pointed out by Birkhoff and Varga [6, 44] that matrices with 
property A belong to a larger class of p-cyclic matrices essentially con- 
sidered by Frobenius [18] and Romanovsky [34]. A matrix A is said to 
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be p-cyclic (p > 2) if, by a permutation of the rows and of the correspond- 
ing columns of A, it can be placed in the form 


D, L, O O 
O D, Ly O 

i) PESeEeeEeeenes | 11.88) 
0 0 O Tc 
L, O O D, 


where the submatrices L; are rectangular matrices and where the diag- 
onal submatrices D; are square diagonal matrices.* Clearly, matrices 
with property A are 2-cyclic. As shown in [44], many results analogous 
to those discussed in Sec. 11.8 can be proved. For instance, if wu is an 
eigenvalue of B, then so is w exp (27ik/p), kK = 1,2,...,;p —1. The 
relation between the eigenvalues A of L,, and those of B is 


(A +o — 1)? = AP WP p?, 
from which it follows, for instance, that 
R(L,) = pR(B), 


thatis, the rate of convergence of the Gauss-Seidel method is p times that 
of the Jacobi method. 


Ifall eigenvalues of B” are real and nonnegative and if0 < A(B) < 1, 
then the optimum relaxation factor w, is given by 


(w, — 1) p’ ake p 
i Cay. 


Evidently, when p = 2, we get 4(m, — 1) = w,?[A(B) ]?, which is equiva- 
lent to (11.58). Moreover, 


A(L,,,) = (wy — 1)(b — 1), 


and for 2 > w > w, and p > 2, 
A(Lu,) < A(Ly) < (w — (p= 1). 


Asymptotically as 4(B) tends to zero, we have 


R(La) ~ (25) [RDI 


* Actually, according to the original definition of a cyclic matrix, the D; should be 
null matrices. However, if the diagonal elements of A are positive, the B matrix 
Corresponding to A will be cyclic in the strict sense. 
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Thus the order-of-magnitude improvement which was seen to hold for 
fp = 2 also holds for larger p. 

An example of a problem with 

B G R fp = 3is the use ofa difference equa- 


: : ‘ tion with a triangular net, shown in — 
Wh Me “ A Ss Fig. 11.4 where the difference equa- | 
\ ra i a a J tions for the points labeled R involve 


B C R values of u at points labeled G, 

Fic. 11.4 where the difference equations for 

points labeled G involve values of u 

at points labeled B, and where the difference equations for points labeled 
Binvolve only values of u at points labeled R. 


11.12 Generalized Alternating-direction Implicit Method 


The alternating-direction implicit method of Peaceman and Rachford 
can be generalized to include somewhat more general linear systems. 
The discussion here is based on [45, 5, 46]. 

Let us consider the linear system 


(H+ V)utd=0, (11.89 


where Hand V are symmetric positive definite square matrices and where 
u and d are unknown and known column matrices, respectively. The 
basic assumption 1s made that H and V commute, that is, that 


HV = VH. (11.90 


The alternating-direction method of Peaceman and Rachford can be 
generalized as follows: 


yimth — ylnt+4) _ r(Hul t's) + Vylr+)) + d). 


Here ris a parameter which may vary withn. Usually Hand V will be 
tridiagonal matrices, and the vectors u'"+*) and u™ can be found bv 
solving successively the linear systems 


(I + rH)ult+’) = (I — rV)u™ — rd, 
(I+ rV)ul*) = (I — rH)u'"+”® — rd, 


using the algorithm described in Sec. 11.6. Eliminating u'"+'*) from 
(11.91), we get 

ut) — Qyl™ 4 6, (11.92) 
where 


QO, = (1 +1V)3(1 — rH) (I + rH)-(I — rV), 
¢= —[(0 +rV)0 — rH) + rH) + (14 9V)— rd. (11.93 
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Since H and V are symmetric positive definite matrices, there exists a 
common basis of eigenvectors for both H and V (see, e.g., [46], Appendix 
A). Consequently, if A and uw are eigenvalues of H and V, respectively, 
with common eigenvector v, then v is also an eigenvector of (J + rV)-}, 
(i — rH), (I + rH)“, (I — rV), and hence of Q,.. The corresponding 
eigenvalue of Q, is 

(1 — rd)(1 — rp) 
see) = ay + tp) 
Moreover, every eigenvalue of Q, is given by (11.94) for some eigenvalues 
of A and u of H and J, respectively. 
The successive use of 7,, 72, ..-, 7, leads to eigenvalues 


_ yl — 7A) — ne) 
TT FA, mr.) =H (1 +7,4)(1 + 7,u) 


t 
of the matrix JT Q,,._ For a given ¢, it is desired to choose the 7, so that 
k=1 


(11.94) 


we Can minimize the maximum absolute value of this expression, the 
maximum being taken over all eigenvalues A and yu of H and JV, respec- 
tively. The method suggested in [45] and [46] for choosing the r, 1s 
according to the following formula: 


l 
Vp bxé-1 ’ k= I, Z ,f, 
h a\itt-)) 
where x= (5) ° 


Here all values of A and uw are assumed to lie in the range 0 <a <A, 
“# <6. Anestimate for the factor of reduction after the ¢ double sweeps 
1s 


fT 4 
Py. = eee Me eof ta—n 
l+wx 


As remarked in Sec. 11.7, one should choose the integer ¢ so that S, = 
(P,)"“ is minimized, though this choice is subject, of course, to the restric- 
tion that P, should not be too small. Thus, if we desire to reduce the 
initial error to a fraction p of its initial value, then P, should not be less 
than p. On the other hand, it may prove to be advantageous to choose 
tsuch that P, > p. In this case, several sets of ¢ double sweeps each 
would be required in order to achieve the desired convergence. 

The analysis of the example considered in Sec. 11.7 is included in the 
foregoing discussion. The matrices Hand V correspond to the difference 
Operators 

Qu(x,y) — [u(x + A,9) + u(x —4,3)] 
and Qu(xy) — [ule y + A) + u(x, y — A), 
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respectively. They are easily seen to be symmetric and positive definite 
and, for the case of a rectangular region, they commute with each other. 
Clearly the sum of these two difference operators is the negative of the 
Laplace difference operator. Finally, we remark that for the unit 
square with mesh size f, the eigenvalues of both H and V lie in the range 


h 
4 sin? > <A, u < 4 cost 


The basic theory underlying the Peaceman-Rachford method cannot 
be extended to include all self-adjoint elliptic equations or to apply to 
nonrectangular regions, as is possible for the successive-overrelaxation 
method.* In fact, Birkhoffand Varga [5] have shown that for Laplace’s 
equation the condition (11.90) can hold only for arectangle. Although 
the alternating-direction method may well be very useful in more 

general cases than those involving 
Laplace’s equation in the rectangle, 
it would appear that either more 
theoretical work or a great deal of 
successful computational experience 1s 
needed before the method can be rec- 
ommended without reservation for 
general use. 
Even in the case of Laplace’s differ- 
ence equation for the region shown in 
Fig. 11.5, which contains only 5 net 
points, the matrix Q,, with r = 1, has 
complex eigenvalues with relatively 
large imaginary parts. If the theory 
Fic. 11.5 which holds for the rectangle could be 
extended to more general regions, 
then, by (11.94), no complex eigenvalues could occur for real r. The 
five eigenvalues of Q, are 


19, M4, 0.2164, —0.05039 + 0.089097, —0.05039 — 0.08909. 


* However, Wachspress, in a private communication, pointed out that, by using 
generalized conditions on H and V as formulated in [45], one can tteat the case where 
the mesh spacings in both coordinate directions are not uniform. To satisfy the 
generalized conditions, neither H nor V need be symmetric as long as there exists 2 
positive definite matrix F such that HF and VF are positive definite. It can be shown 
that, using the generalized conditions, one can apply the theory to problems involving 
the partial differential equations 


F ou a . Ou 
5 FICO? 4 + 5] HoRt :| = S34) 


and the rectangle. 
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One important positive result concerning the Peaceman-Rachford 
method should be mentioned. It has been proved in [46] that for any 
positive r the spectral radius of Q, is less than one. It is assumed that H 
and V are symmetric and positive definite, but they need not commute 
with each other. 


11.13 Applications and Numerical Experiences 


In this section some experiences derived from solving certain problems 
on high-speed computers are described. Some problems were designed 
primarily to test the numerical methods, whereas others were actual 
practical problems. 

As described in [57], a program was prepared for the ORDVAC 
computer at the Aberdeen Proving Ground, Maryland, for solving, for a 
class of regions, either the 5-point or the 9-point finite-difference analogue 
of the Dirichlet problem. ‘The main purpose in preparing this program, 
other than to study the accuracy of the finite-difference methods, as 
described in Sec. 11.4, was to evaluate the successive-overrelaxation 
method. A number of Dirichlet problems were solved for different 
regions and using different values of w. As many as 19? = 361 interior 
net points could be handled. The behavior of the successive-over- 
relaxation method was very much as expected from the theory. The 
effect of using the best relaxation factor reduced the number of iterations 
necessary for convergence by a factor which was as large as 10 in some 
cases. For the square where the theoretical optimum w is known 
exactly, the observed best w was found to agree very closely. It was 
also observed that, in accordance with the theory, the effect of using a 
value of w slightly greater than optimum was much less serious than 
the effect of using one which was too small. It was also observed that, 
if a certain value of w was used, the number of iterations required was 
the same for all problems for which the optimum value was less than the 
value used. This also agrees with the theory. It was concluded that, 
if only a few problems are to be solved for a given region, then it is best 
to base the choice of w on the computed value for an appropriate rectangle 
of the same area. On the other hand, if a great many problems are 
to be solved, then it may pay to perform a number of iterations with 
a = 1 and to estimate A(L,), from which w can be computed by means 
of (11.40). The variation of the number of iterations with w for the 
9-point formula was approximately the same as for the 5-point formula. 
As expected (see Sec. 11.9), the optimum w for the 9-point formula was 
slightly smaller than for the 5-point formula. 

A similar set of numerical experiments were performed in connection 
with the equation U,, + (k/y)U, + U,, = 0 for various values of the 
constant k, as described in [59]. Boundary-value problems for the 
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region0 <A <y<A+B,C <x < D were treated. It was found 
that the methods developed by Warlick [50] for estimating the optimum 
value of w were very satisfactory. On the other hand, values derived by 
assuming k = 0 gave results which were almost as good. It was ob- 
served that, although the use of the equation in the form given above 
rather than in the self-adjoint form ()*U,), + ()*U,), = 0 leads to 
equations which are not symmetric, the rates of convergence were 
almost the same as when the self-adjoint form of the equation was 
used. * 

Among the important applications of elliptic partial differential 
equations are problems in weather prediction, nuclear-reactor physics, 
and in flow studies. Charney and Phillips describe in [8] the methods 
used at the Institute for Advanced Study with the IAS computer for 
weather prediction. Although the problem is, of course, three-di- 
mensional and time-dependent, a basic step in the numerical process 1s 
the solution of a two-dimensional problem. The method of successive 
overrelaxation was used, with apparently satisfactory results. 

As described in [58] and [4], an approximate solution was obtained 
to the problem of determining the 
axially symmetric flow past a pair of 
disks between which there is assumed 
to be a cavity, as shown in Fig. 11.6. 
The problem is unusually difficult be- 
cause the boundary of the cavity is not 

Fic. 11.6 known. It is necessary to assume a 

trial boundary, to solve an elliptic 

boundary-value problem, to test certain auxiliary boundary conditions, 

and, if these are not satisfied, to modify the boundary and repeat the 
process. 

Since it is clearly necessary to solve many boundary-value problems 
before one can even hope to find the proper boundary, the rate of 
convergence of the iterative method used to solve the difference equation 
is of criticalimportance. This was particularly true in the computation 
described, since there were more than 1000 mesh points. Un- 
fortunately, the use of the successive-overrelaxation method, although 
considerably better than the Gauss-Seidel method, was not so effective 
as was expected. Since the use of the estimated best value of w led toa 
divergent process, it was necessary to use a much smaller value, with an 
accompanying loss in convergence speed. It is believed that this 
phenomenon was caused by the use of special difference formulas for 
points connecting regions of the fine and the coarse mesh. Different 
mesh sizes were used, since near the edges of the disks a mesh size was 

* An explanation is given in [50]. 
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needed which, if used throughout the entire region, would have involved 
too many mesh points. 
Thus, suppose that the numbered points 1, 2, 4, 5 in Fig. 11.7 belong 


(z) d ee .f 
4°65 .g «hou 


Fic. 11.7 


to the coarse mesh and that the points a to 7 belong to the fine mesh. 
The difference equations for points 2 and 5 involve certain adjacent 
points of the coarse mesh, as well as points 5 and 4, respectively. The 
difference equations for 4, c, ¢, f, hk, and 2 involve points of the fine mesh. 
The only questions arise in connection with points like a, d, and g. 
Clearly, for a and g we can use adjacent points of the fine mesh and points 
2 and 5, respectively. Now, in the work described in [58, 4] for point 
da special formula was used which involved points 2, 6,5, and’. The 
use of this formula led to a system of linear equations with a matrix 
which was not symmetric and which did not have property A. Although 
these conditions are by no means necessary for convergence, the fact that 
they were not satisfied here had a very unfavorable effect. 

Two methods are suggested for improving this situation. In the 
first of these, one introduces the point z and, instead of trying to represent 
the difference equation, one simply uses linear interpolation; that is, 
one lets u, = }4(u, + us). Although the resulting linear system is not 
symmetric and does not have property A, the convergence is reported by 
Dr. R. J. Arms to be much improved, at least for some problems. 

An alternative procedure would be to use a variational approach, 
already mentioned in Sec. 11.3. This would lead to a linear system 
with a symmetric and positive definite matrix. It would be necessary, 
however, to try to prevent the occurrence of negative coefficients. 
Kahan’s results would indicate that the presence of negative coefficients 
would be a much more serious disadvantage than the fact that the matrix 
would not in general have property A. Further studies along these 
lines are indicated.* 

A great deal of work is currently being done on two- and three- 
dimensional multigroup diffusion calculations arising in nuclear-reactor 


*Note added in proof: Kahan (in a private communication) reports that in a 
number of numerical experiments the occurrence of small negative coefficients did not 
have a serious effect on the convergence. In every such case the matrix was 
symmetric. 


(Go gle 


418 SURVEY OF NUMERICAL ANALYSIS 


studies (see, e.g., [39, 43, 46, 22]). Mathematically, a_ typical 
problem consists of a system of n simultaneous elliptic differential 
equations, each corresponding to one of n “groups” or neutron 
energy levels.* It is desired to determine a single parameter 4, 
sometimes called the criticality parameter, which appears in one of the 
equations. 

The usual procedure is to iterate with respect to the parameter 7. 
Each such iteration is known as an “outer” iteration. Each outer 
iteration consists of solving successively each of the n equations for the 
corresponding function, assuming that the other (n — 1) functions are 
known. The process of finding such a function is precisely that of 
solving an elliptic partial differential equation in a two- (or threc-) 
dimensional region. Each iteration of the iterative process which 1s 
-used to do this is known as an “inner’’ iteration. Thus, each outer 
iteration consists of n sets of inner iterations. Here more than ever, 
because of the large number of elliptic boundary-value problems which 
must be solved, the rate of convergence of the iterative procedure is 
crucial. 

Varga describes in [43] a program for the IBM 704 computer for 
solving two group problems. ‘This program, known as the QED code, 
has since been modified so as to handle four groups. The modified 
program is known as the PDQ code [62]. With a computer having 
16000 words of core storage, up to 6500 mesh points can be handled ina 
rectangular region where, however, the coefficients of the differential 
equation are allowed to be discontinuous across straight lines known as 
interfaces. Across interfaces certain continuity conditions involving the 
functions u, and their normal derivatives are imposed. 

As described in [62], before beginning the iterative process, the best 
overrelaxation factor corresponding to each differential equation 1s 
estimated. The largest eigenvalue of the matrix B (see Sec. 11.8) is 
found by an iteration scheme due to Hestenes and Karush [63]. Con- 
siderable success with the method has been obtained, with the result that 
the PDQ code is now regarded as among the best nuclear-reactor pro- 
grams presently available. 

Wachspress describes in [46] the CURE program for the IBM 704 
computer which can be used to solve neutron diffusion problems involv- 
ing three groups and as many as 3000 mesh points. The Peaceman- 
- Rachford method of iteration is used. The convergence of the method 
has been reported to be extremely satisfactory even for cases where the 
present theory does not apply. 


* It may also be considered as a single equation relating an elliptic differential 
operator to a derivative with respect to a “‘timelike’’ variable (see, e.g., [39]}. This 
equation 1s known as the age diffusion equation. 
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PARABOLIC EQUATIONS 
11.14 Introduction 


We shall consider the class of parabolic partial differential equations 
U, = Use + $(%6,U) (11.95) 
and U,= U,, + U,, + v(xy,t0), (11.96) 


where Us the dependent variable and ¢ = ¢(x,t,U) and y = y(x,7,t,U) 
are analytic functions of their arguments. It is easy to show that the 
types of boundary conditions which can be imposed for parabolic 
equations and which lead to problems with unique solutions are quite 
different from those for elliptic equations. For instance, with (11.95) 
we Cannot in general prescribe values of U over a closed curve in the 
(x,t) plane. We can, however, impose the following boundary con- 
ditions: | 


U(0,t) = 2,(¢), (11.972) 
—-U(1t) = @ (0), (11.978) 
U(x,0) =f (x). (11.976) 


The functions g,(¢), go(¢), and_f(x) are given functions of their arguments 
which are continuous except for a finite number of finite jumps. Here 
we are considering the region K: 0 <x <1,t>0. Since ¢ frequently 
refers to the time variable, we sometimes refer to (11.97c) as an initial 
condition. 

Whereas for elliptic equations the value of the solution at any one 
point depends on all the boundary values, in problems defined by (11.95) 
and (11.97) the value of u(x,t.) depends only on the values of f (x) and 
on the values of g,(¢) and g,(t) for values of ¢ in the range 0 <¢t < ty. 
This has important implications for the determination of numerical 
solutions by the use of finite-difference methods. In problems involving 
parabolic equations one can construct numerical solutions step by step, 
using a so-called “‘marching”’ process. For elliptic problems, on the 
other hand, the solution cannot be found at any single point until it 1s 
found at all points. 

Although the foregoing would tend to indicate that it is much simpler 
to compute the numerical solution of parabolic problems than of elliptic 
problems, the use of marching procedures introduces a difficulty not 
present in elliptic problems, namely, the problem of stability. Consider- 
ations of stability make it necessary to reject certain otherwise satisfactory 
methods. However, as shown in the next two sections, the stability 
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problem can be overcome in an entirely satisfactory manner by the use 
of so-called “implicit”? methods. 

In Sec. 11.17 we discuss the solution of problems involving two space 
variables in addition to the time variable. The alternating-direction 
implicit method of Douglas [11] appears to be among the most promising 
for such problems. ‘The close relation between the methods used to 
solve these problems and certain iterative methods for solving elliptic 
equations is brought out in Sec. 11.17. In Sec. 11.18 a procedure 
developed by Garabedian [19] is used to study the rate of convergence 
of the successive-overrelaxation method by considering a certain hyper- 
bolic differential equation. 


11.15 The Forward Difference Method 


Although the discussion here and in the following section 1s limited to 
problems in two variables, x and ¢, involving Eqs. (11.95) and (11.97), 
much of what is said applies also to problems involving (11.96). 

Perhaps the simplest finite-difference method for solving problems 
defined by (11.95) and (11.97) is the forward difference method. First one 
considers a network of points (x,¢) such that x = ih,t1 = 0,1, 2,..., A, 
where M is an integer and k = M~—', and such thatt = jk. Here A and 
k represent the space and time increments, respectively. Let us consider 
a typical 4-point configuration, as shown in Fig. 11.8. 

We represent the space derivative by 


U, + U; =U 
(= re che (11.98) 


and the time derivative by a for- 


(x., t,) . 2 
oe ward difference quotient 
ene 6 
k (C= ; 8. (11.99) 
(x3, t3) (x14) Our notation here is the same as 
h h in Sec. 11.2. 
Fic. 11.8 To derive the forward difference 


equation, we substitute (11.98) 
and (11.99) in (11.95), obtaining, upon replacement of x, by x and ¢, 
by ¢, etc., 


u(x,¢ +k) =rlu(x +4, t) + u(x — A, 2)] 
+ (1 — 2rju(x,t) + rh?d[x,tyu(x,t)], (11.100) 


where, for convenience, we let 


ro, (11.101) 
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If we letu; , = u(th, jk), };.; = (th, pk,u,,;),f; =f (ih), etc., then we have 


Us gay = T(Usan,3 + Ue-a.3) + (1 — 2r)u;,; + 1h, 5, 
oe ey Serr) ee PS 12 eaws, “CLT102) 


Since u, 9 =i, Yo,; = (81)4 Ua. = (Z2);, One can readily compute 
successively u, , for each z, then u, 9, etc. 

In [25] it is shown that, for the case ¢ = 0, g, = g, = 0, the solutions 
of the difference equation converge to the solution of (11.95) and (11.97), 
provided r is fixed andr < %. For the same case, Leutert [30] has 
shown the existence of a set of solutions of the difference equation with 
ry > which converge to the solution of (11.95) and (11.97). This set 
of solutions was constructed, not to be used in practice, but simply to 
demonstrate the distinction between convergence and stability. 

The problem of stability and its relation to the problem of convergence 
have been treated in a number of papers (e.g., [42, 29, 23, 31, 13]).  Itis 
very difficult to define precisely what is meant by the term “stability.” 
Roughly speaking, if a process is unstable and if an error, such as a 
rounding error, 1s made at any stage af the computation, this error will 
increase exponentially as j increases. It frequently happens that an 
unstable process is not convergent, although Leutert’s example shows 
that this is not always the case. 

We shall try to illustrate some of the foregoing by anexample. Let us 
assume g, = g, =Oand¢=0. Let f(x) =1,0 <x <1, f(0) = 
f(1) =0. Leth =Mandr=%. Our difference equation (11.102) 
becomes 


Us is41 Va (Uiaa.s + U;_y,;) (ede 2) Oy 7 Oly oye a) 
U;g = 1 Ct S21 5.2,3)5 (11.103) 
Ugy = Ug = 0 (7 = 0,1, 2,...). 


The numerical solution may be given by the following table. (The 
quantities in parentheses should be temporarily ignored.) 


~ 0 l 2 3 4 


0 0 1.000 1.000 1.000 0 
] 0 900 1.000 300 0 
2 0 500 500 (+e) 500 0 
3 0 250 (+6€/2) 900 250 (+e€/2) 0 
4 0 .250 250 (+€/2) 250 0 
5 0 125 (+€/4) 250 125 (+6/4) 0 


It is not difficult to believe that, even though the values obtained for 
any given # are not “smooth,” the numbers obtained by this process 
would approach the exact solution of the differential equation as A and 
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k were decreased. To study the stability, let us introduce an error of «€ 
Inu, >. Because the problem is linear, we can study the error separately, 
as indicated in the above table. We note that in this case the error 
decreases with 7 and the process is stable. 
Let us next consider the case where r= 1. Equation (11.102) 
becomes 
Uij-y = Ya TT Yi-1.5 T Yi 


and we obtain the following results: 


“od 0 ] 2 3 

_ 
0 0 ] ] l O 
] i) 0 l‘—e 0 O 
2 0 l( ~e: —|i—e, l(-e O 
3 0 —2{ —2e;} 3i — 3e: —2(—2e) 0 
4 0 5f ~5e) —7i—7Te} 5 + 5e! 0 
5 0 —12(—12e) 17(~17e: —12( — 12} 0 


Since it is easy to show that in this problem all values of u lie between 
zero and unity, it is clear that the above numbers are considerably in 
error and that the error isincreasing rapidly withy. Weare not surprised 
to learn that this process is not convergent. Moreover, astability analy- 
sis similar to the one performed previously for the case r = }2 indicates 
that the process is not stable. 

If the initial values were defined by f(x) = sin x, then the method 
would be convergent, though not stable. Thus we would obtain the 
following results: 


Noa 0 1 2 3 4 

J 
0 0 707 1.000 707 0 
0 293 414 293 0 
2 0 121 172 121 0 
3 ( 051 070 051 0 
4 0 019 032 019 0 
5 0 013 0U6 013 0 
6 0 — 007 020 ~.007 0 


Here the values are “respectable” until we reachy = 5, when, because of 
rounding errors, they begin to oscillate. Ifwe had carried more decimal 

_ places in our calculations, the occurrence of the oscillation would have 
been delayed. Finally, if we had used exact values throughout, that is, 

if we had used Vv 2/2 instead of .707, etc., the oscillations would never | 
have occurred. Naturally, these latter considerations are largely | 
academic, and one must in practice use r < %@ with the forward differ- 
ence method. 
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11.16 The Crank-Nicolson Method 


The requirement that the ratio r (= k/h?) be fixed implies that, as 4 
tends to zero, k tends to zero very rapidly. This means that, if one is 
interested in the value of u for a particular point (x,t) and if one halves A, 
then the number of time steps must be increased by a factor of 4. The 
amount of computation required would thus be increased by a factor of 8. 

This difficulty may be overcome at a moderate cost by the use of the 
implicit difference equation used by Crank and Nicolson [10]. The 
difference equation is the same as the forward difference equation except 
that in the approximation to U,, one uses, not the difference quotient 
involving values of u for the current value of ¢, but the average of the 
difference quotients for the current and for the new value of t. 


(X44) (Xg,ly) (X5,¢5) 


(x, 5¢,) 


Fic. 11.9 


Accordingly, we use the following formula for (U,,) 9: 


1/U, + U, —2U U.+U, —2U, 
ee 3 0 5 1 2 
(Cse)o A h? a h 


and obtain, upon substitution into (11.95) and replacement of U by u, 


Carew, + Uy gH Ue j4 TF Ui-1,j41 — 2Us 54 i) + (1 aad r)U, 


hols 


ui j-1 = 


4d, i= 1,2,...,4f —137 =0,1,2,.... 
(11.104) 


Since Eq. (11.104) involves u;,,;,, and u,;_,,,,, a8 well as u, ,,,, itis said 
to be an implicit difference equation. Givenu, ,fort = 1,2,..., Af — 
1 as well as ug ;, Uyy 5) Uo. 541, and uy, ;,,,Inordertofindy, ,,, for: = 1, 2, 
..., Af — 1 we must solve a system of (Mf — 1) linear algebraic 
equations with (Mf — 1) unknowns. One way of doing this would be 
to use the successive-overrelaxation method. It is not difficult to show 
that @ <7/(1 + 1), from which the optimum value of can be found by 
the following formula: 
Oo, = 1+ (——). 
l+vil— #*: 
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Such a procedure was used in [60]. Evidently the rate of convergence 
decreases as r increases. If is chosen proportional to A as A decreases, 
then the total computational effort varies as h-“ with the above pro- 
cedure, as compared with /-3 for the forward difference method. 

We remark that greater accuracy can be obtained by using 


(rh?/2)(b.5 + $4,541) 


instead of rh?¢, , in (11.104). Experience has indicated (see [60]) that 
the number of iterations necessary with the successive-overrelaxation 
method changes very little when this modification is introduced. 

Since the matrix of the linear system to be solved at each time step is 
tridiagonal, one can use the method described in Sec. 11.6 [see (11.43) and 
(11.44)] to solve the linear system involved in passing from a given time 
step to the next. As mentioned earlier, the first use of this algorithm to 
solve parabolic equations is described in [7]. The formulas as applied 
_to (11.104) are as follows: 


ae rh? 


D,.; = (u-4,3 a5 Us+1,5) eae 


r 
2(1 + 1) 
PS 2 eeag 1s 


r r 
by; = 9, b.,= — oat + +n iiswals 
WSs 2 osc aged 2s 
(11.105) 


r 
91.3 = Dy, 5, qi4 = [D.. =P a1 +7) gas] / 


r ; 
[ + apy bsp = 23,00 MA 


Uyt—1.341 — YM -1.4 Used = Vig bs sat.94 LD 


oe) eee) eee ee 


Let us consider now the example given in Sec. 11.15 with r = 1. 
Proceeding from 7 = 0 toy = 1, we obtain the following: 


On, ; 0 1.000 1.000 1.000 0 
Di; 250 500 250 
- 0 —.250 — .266 
95 3 250 .600 428 
lu; 0 428 714 428 0 
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We see that the amount of work necessary per time step 1s approximately 
four times as great as for the forward difference method. However, the 
size of time step & can be made much larger. If we assume that & varies 
as h, as fh tends to zero, then the work involved increases as 4-? with the 
Crank-Nicolson method, compared with A-° with the forward difference 
method. Of course the constants of proportionality are different. We 
have already seen how a factor of approximately 4 is introduced by using 
the algorithm for solving the special linear system. Also, if ¢ occurring 
in the differential equation is nonlinear, one may need to apply the 
formulas (11.105) two or more times. After the first application of 
(11.105) we would normally replace ¢, , by 2(¢, ; + $,.j4;) and reapply 
(11.105). Generally speaking, one such iteration should be sufficient 
_ to obtain an accuracy consistent with that which is inherent in the basic 
difference equation. ‘Therefore, even for nonlinear equations there is 
_ still an important order-of-magnitude advantage of the Crank-Nicolson 
method over the forward difference method. 

The Crank-Nicolson method is stable for all values of 7 and, in fact, is 
stable for any pair of positive values andk. Moreover, for cases where 
¢@ = Oand where g, = g, = 0, Juncosa and Young [26] have proved that 
the method is convergent, provided f(x) is piecewise continuous and 
provided k = O(h/|log h|) ashtendstozero. This of course implies con- 
vergence for allr. We remark that an error in logic which is frequently 
made when discussing such matters is to assume that, since there are no 
restrictions on 7, then there are no restrictions on k. We also remark 
that, although convergence has been proved only for k = O(A/|log A|), 
it appears likely that it holds also for k = O(h). This assumption has 
been made in the previous comparisons of the Crank-Nicolson and the 
forward difference methods. 


11.17. Problems Involving Two Space Dimensions 


Let us consider the following problem: given a bounded plane region 
R with boundary S and given functions f(x,y) defined in Rand g(x, y,t) 
defined on S’, find a function U(x, y,t) which is continuous in R’ + S’ and 
satisfies 


U,, + U,, = U, in R’, 
U(x, 7,0) = f(x,y), (x,y) ER, (11.106) 
U(x, y,t) = g(x, 7,8, (x, y,t) € S’. 


Here R’ and S’ are the sets of all points (x, y,¢) such that ¢ > Oand (x,y)eR 
and such that ¢ > 0 and (x,y) € S, respectively. It is assumed that U is 
twice differentiable with respect to x and_y and once differentiable with 
respect to ¢in R’. 


Google 


426 SURVEY OF NUMERICAL ANALYSIS 


For simplicity, we restrict our attention to the case where R + Sisa_ 


rectangle with sides a = Mh, b = Nh (M,N integers) andh = Ax = Ai | 


isthe common space increment. Itshould be noted, however, that much 
of what follows applies to considerably more general regions. 

The simplest method again is the forward difference method. The 
formulas and the method are direct extensions of those discussed in Sec. 
11.15. The formula is 


u(x, y,¢ + At) = u(x,y,t) + rlu(x + Ax, 9, 0) + u(x — Ax, 3, 0 


+ u(x, y + Ay, t) + u(x, y — Ay, t) — 4u(x,7,0)]. 
(11.107; 


We remark that the condition on k = At for stability and convergence is 
now 


= : < : 11.108: 
=F SG (11.108, 
As before, although the computational scheme is explicit and straight- 
forward, the requirement of (11.108) imposes such a severe limitation on 
the size of k that the method is of doubtful practicality. 

It is perhaps interesting to note that there is a very close connection 
between the method just described and Richardson’s method of iteration 
[33] for solving Laplace’s equation, of which the Jacobi method of simul- 
taneous displacements is, in this instance, a special case. Thus, if we 
replace u(x, y,t) by u(x, y) and u(x, y,¢-+ k) byu™t) (x, y) in (11.107), we 
get 

uD (xy) = ul) (x,y) 4+ [ul (x + Ay yx) + ul) (x — A, ¥) 

+ ul (x,y +h) + ul (x,y — kh) — 4u' (x, y)], (11.109) 


which for r = 4% reduces to the Jacobi method. 
For the rectangle the eigenvalues associated with this method are 


2a 2b 
(o=1,2,...,4—1;  g=1,2,...,N—1). (1.110 
These correspond to the eigenvectors 


jh 
bru = cieeee 


(2, aig = aa sin —— sin bh! 5) 


wh 
Mog = ] —_ 4r(sine +s n? a - 


which vanish on the boundary of the rectangle. 

Roughly speaking, the number of iterations needed for convergence for 
the elliptic equation is closely related to the number of time increments 
necded to achieve “‘steady state” in the associated transient heat-flow 
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problem. Since the larger the ¢ variable, the closer one is to a steady- 
State condition (assuming that the boundary values for u are constant), 
one would like to take as large a value of k as possible. However, by 
(11.110), we are limited by stability considerations for the transient 
problem. If we wish to use a fixed value of r, we can take 7 only very 
slightly greater than 4%. Actually the r for which the maximum of yu, , 
over all ,g is minimized is r = . 

On the other hand, by taking r considerably larger than 4%, we may 
reduce certain components of the error [corresponding to some of the 
pairs (p,q) ], but at the same time we may increase others. By choosinga 
sequence of values of r based on Chebyshev polynomials and by using 
these values repeatedly in a cyclic order—as described, for instance, in 
[56]—the components of the error can all be reduced by more or less 
uniform amounts. There are two major practical difficulties associated 
with the method. First, it was found in [56, 54, 61] that, even when care 
was exercised in the choice of the order in which the values ofr were used, 
the control of rounding errors was a serious problem. It appears that 
this difficulty can be at least partially overcome by the use of a formula 
which 1s analogous to (11.109) but which involves u!"—)) as well as u™ and 
u'"+!) (see, e.g., [40] and [46], Appendix C). This formula is based on 
the use of a three-term recurrence relation for Chebyshev polynomials. 
The use of the new formula also appears preferable because after m itera- 
uions, for any m, the result which one obtains is equivalent to that which 
one would have obtained by applying (11.109) m times using the best set 
ofmvaluesofr. IRfone uses (11.109) itself, however, with a fixed number, 
say 5, of different values of, the only intermediate results which would in 
general have significance would be those obtained after each set of s 
iterations. 

The second difficulty, or limitation, associated with Richardson’s 
method essentially rests on the fact that, in significantly reducing some 
error components, others may be substantially increased. It would be 
very desirable to have a method by means of which one could reduce any 
error component without increasing others. Both the methods of Peace- 
man and Rachford [32] and of Douglas and Rachford [12], the first of 
which is described later in this section, have this property, at least in 
certain special cases. 

The Crank-Nicolson method for problems involving two or more space 
variables is likewise a direct generalization of the method described in 
Sec. 11.16 for one space variable. In order to proceed from one time 
step to another, one must solve the difference analogue of a Dirichlet 
problem in two dimensions. Unfortunately there does not appear to be 
_ any simple scheme for doing this, as there was in the case of one space 
_ dimension, Ina manner similar to that described in Sec. 11.16, a good 
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estimate of the optimum relaxation function for the iterative process is 
very easy to make. Indeed, a good value of to be used in the formula 
for the optimum relaxation factor would be # = 2r/(1 + 2r). There 
are no limitations on k and A as far as stability is concerned and, as in the 
one-dimensional case, it appears that convergence could be proved for 
k = O(h); consequently the work required varies as h~-“, compared with 
h-* with the forward difference method. 

There does not appear to be any useful interpretation of the Crank- 
Nicolson method to iterative methods for solving elliptic difference equa- 
tions. The reason for this is that no techniques other than any of the 
already-known iterative methods have been used to carry out the process. 

Two methods for solving time-dependent problems have led to new 
techniques for solving elliptic equations. We restrict our attention here 
to the method of Douglas, Peaceman, and Rachford [11] and [32], even 
though the method of Douglas and Rachford is somewhat more general, 
in that it can be extended to apply to three dimensions. The 
former method appears to be somewhat more efficient in certain special 
cases. 

The basic process of passing from ¢ to ¢ + kis performed in two parts. 
Each part involves an implicit formula for which, however, the associated 
matrix is tridiagonal. In the first part one has to solve an implicit system 
for the rows, whereas in the second part one has an implicit system for the 
columns. ‘The actual formulas are the following: 


k k k 
ult + a) = uleyt) tral t+ hye +5 +(x —h,y,0+5] 


= 2u(x, y,t + ‘) + u(x,» + A, t) + u(x, y — A, t) — 2u(x0) | 


u(x, y,¢ +k) = u(x, +5) + u(x +h, y,t + ) (11.112) 


k 
t u(x —A,3,¢ +5] — m(x,9,¢ +5) +u(x,y + ht + k: 


+ u(x,y —A,t +k) — 2u(x 3, €-+ #) |. 
Here we let 
r = k[2h? 


instead of the usual value 4/h?. 

It is shown in [11] that the local approximation of the difference equa- 
tion (11.112) to the differential equation u, = u,, + u,, is of the same 
order in A and kK as is the Crank-Nicolson difference equation. On the 
other hand, since each step of the present procedure can be carried out 
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explicitly, using the formulas given in Sec. 11.6, the amount of work 
required is much less than for the Crank-Nicolson method. It is further 
shown in [11] that, for the case of the rectangle, the method is stable for 
any pairofvalueshandk. It is interesting to note that the repeated use 
of either part of (11.112) alone would yield a process which could be 
unstable for some values of h and k. 

As shown in [32], the fact that one is able to take large time steps with 
relatively little effort suggests that one could develop an efficient iterative 
procedure for solving the difference analogue of Laplace’s equation. As 
a matter of fact, since formulas analogous to (11.112) could be derived 
corresponding to the differential equation u, = L(u), where L(u) is a 
more general elliptic operator, such a method could be, formally at least, 
applicable to a much wider class of elliptic equations. If we replace 
u(x, y,t) by u(x, y), u(x, 9, £ + k/2) by ul"**)(x, y), and u(x, y, £ + k) by 
ult] (x, y), we get formula (11.46). The method so obtained is frequently 
referred to as the Peaceman-Rachford method and also as the alternating- 
direction implicit method. ‘The former terminology seems preferable, since 
both the Peaceman-Rachford method and the Douglas-Rachford method 
are alternating-direction implicit methods. 

Following an analysis similar to that given for the derivation of 
(11.112), we have for a single row iteration 


, | — 4rsin® (q7h/2) 
Mea = T+ 4p sin® (pah/2) 


and for a single column iteration 


» 1 — 4rsin® (prh/2) 
Mea T+ ap sin® (qh/2) | 


For a double iteration we have 


_ 1 — 4rsin? (q7h/2) 1 — 4rsin* (pah/2) 
Moa 1+ 4r sin? (gmh/2) 1 + 4rsin® (pah/2) ° 


We note that in contrast to the situation for the forward difference 
method, all of the u, , are less than one in absolute value for allr. We 
note that even here, where we can safely take values of r as large as we 
please, we nevertheless cannot achieve arbitrarily rapid convergence 
merely by choosing r sufficiently large. As described in Secs. 11.7 and 
11.12, the convergence is accelerated by choosing a set of s different 
values of r so as approximately to minimize the expression 


8 
max JT |#5.9(7)|- 
lip<M-1 k=1 
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11.18 Estimation of Convergence Rates for 
Elliptic Problems 


We have already seen how methods for solving parabolic partial dif- 
ferential equations often lead to iterative methods for solving difference- 
equation analogues ofellipticequations. The methods considered in the 
preceding section led to methods of simultaneous displacements, in which 
new values obtained during an iteration are not used until after a com- 
plete iteration. Such simultaneous-displacement methods can be asso- 
ciated with parabolic equations. On the other hand, certain methods 
of successive displacements, such as the successive-overrelaxation method, 
can be associated with hyperbolic equations. Furthermore, an analvsis 
ofsuch equations can often be used to provide information concerning the 
rate of convergence of the iterative method. 

For example, let us consider a one-dimensional finite-difference ana- 
logue of the Dirichlet problem 


u; = (uj. + u441); (11.113; 


where uw, and uy, are given. Of course, this difference equation could 
very easily be solved analytically. However, the analysis of the suc- 
cessive-overrelaxation method as applied to this problem is similar to 
that for more complicated problems. 

The iterative formula for the successive-overrelaxation method is 


un = > (ul, + uM4?) — (wo — 1)”. (11.114) 
We seek to obtain by a method which 1s a slight modification of that 
used by Garabedian [19] a related hyperbolic equation, the analysis of 
which will lead to an approximate formula for the optimum relaxation 
factor. 
We let ul = u(x,t), ut? = u(x,t +k), and Wi, = u(x + hb). 
Equation (11.114) becomes 


u(t +k) = 5 [ule + A) + ule — byt + A] — (o — Vuls,0). 


(11.113: 
Expanding by Taylor’s theorem about (x,t), we get 


9 


2 


oe mae | — (eo — l)u. 


Kk w 
Sua tee (ut hu, + 


5 Ur, ters tu— hu, 


u + ku, + 
j2 2 
+ ku, + 7 tes — hku,, + 3 
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Neglecting terms involving h3, h?k, Ak?, k3, and higher-order terms, we 
have* 


k w w 
(ix, + > us) ( — 2) =95 (h?u,, — hku,,). 


Dividing both sides by hkw/2 and letting h/k = a, we obtain 
C(u, +- ; us) = Ay, — Us, (11.116) 


oo 


wh 


Introducing the new variable s = ¢ + (!a)x and letting U(x,s) = 
u[x, s — (a)x], we get 


where 


(11.117) 


ty.) <a. 20 

c(u,+ 50, = aU, — 7 Us: (11.118) 

Now, using the method of separation of variables, we let U(x,s5) = 
X(x)S(s), obtaining, upon substitution in (11.118), 

X"(x) = (1/4a + Ck/2)S"(s) + CS'(s) 

a oe ee 11.119 

x) 0 area 

In the study of the convergence rate of the method defined by (11.114), 

it is sufficient to consider the difference u‘”) — u,;. Hence we may 

assume that uf” = uy,"") = 0 for all n or, equivalently, that u(0,t) = 

u(1,¢) =O for all ¢. This implies that X(0) = X(1) = 0. Since both 

sides of (11.119) must be constant, we have X(x) = sin prx,p = 1,2,.... 
Moreover, we may let S(s) = e~"*, where satisfies 


(;. + C5) p2 — CB + ap?n? = 0 
7 ; . 7 () - c(i) i om (11.120) 


Solving for (8/a), we obtain 
B C+ VCP — pPn®(1 + 2Ch) 
a Ve(1 + 2Ch) 
We assert that the rate of convergence for the method defined by 


(11.114) is determined by the smallest value which the real part of (8/a) 
can assume for all positive integral values of. For ona single iteration 


(11.121) 


* Garabedian [19] neglected the terms involving k?. The result obtained here 
agrees with Garabedian’s result to a first-order approximation. 
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— pk —~(.ajh 


the error is reduced by a factor e =e Thus, for given A, the 
larger this minimum value of (f/a), the faster will be the convergence 
rate. We seek to show that, to a first approximation, the opumum 
choice of C is 7 and the corresponding minimum value is 27. 

Let C, be the positive root of the quadratic equation 


Git ak ee 2h) =O. (11.122) 
which is given by 
C, = hr? + V Aen + oe? = a + OFA). (11.123, 
IfC = C,, then for p = 1, we have 


B Ze C; = 2a 
Re (7) = a1 4 QAC,) ~ 1 + QAC, 


Since (d/dc)[C? — p??(1 + 2hC)] = 2(C — p?n?h), it follows that this 
derivative is positive ifC 2 7,p = 1, andifk < 1/7. ForC >C, >~z, 
it follows that C? — w?(1 + 24C,) > O. The roots of (11.120) forp = | 
are real, and the smaller is given by 


B a 


a [C+ VC? — pen®(1 + 2Ch)] 


=n + O(h). 


On the other hand, since the smaller root of (11.122) is negative, and 
since for very large values of C the function 


g(C) = C2 — w%(1 + 2AC) 


is positive, it follows that g(C) < Ofor0 S$ C<C,. This being the case, 
the roots of (11.121) for 0 <C <C, and for p = 1 are complex, and 
their real parts are given by 


a) C 
Re (£) =; 5 20, = O(f) . 
e{- ral + 2Ch) 2C, = 2a + O(h) 

Thus, to a first approximation, one cannot do any better than to 
choose C = zm. From (11.117), the corresponding value of w is 


2 
1 + 7h’ 


O= (11.124 
As shown in Sec. 11.9, this agrees with the result obtained by the use o! 
overrelaxation theory. 

The preceding result can, of course, be derived either as a special case | 
of the general theory of Sec. 11.9 or by direct use of the difference equa- 
tion (11.115), using the method of separation of variables. This is true | 
of two-dimensional as well as one-dimensional problems. 
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It is interesting to note that the new variable s = ¢ + ax could be 
introduced directly into (11.115), with the result 


U(x, 5 + k) = 5(Ul +h,s + ah) + U(x —h,s — ah + k)] 


— (w — 1)U(x,s). (11.125) 
Letting « = k/2A, we obtain 


k 
U(x,5 + k) ~ Sl u(x +h,s +5 + u(s —Ah,s +5) 


— (wm — 1)U(x,s). (11.126) 
We can then let U(x,s) = sin (prx)e~", obtaining 
e~* — w cos (pah)e~™*? — (w — 1). (11.127) 


Since U(x,s) = sin (pax)e~™—**, it follows that in one time step (itera- 
tion) the error is reduced by a factor A =e ™. Substituting in (11.127), 
we get 


At+o—1=onk"*, (11.128) 


where = cos prh is an eigenvalue for the Jacobi method. Since 
(11.128) is the basic relation between the eigenvalues of the successive- 
overrelaxation method and of the Jacobi method, the rest of the analysis 
is as given in Sec. 11.9. 

Garabedian [19] applied his analysis to the 9-point difference- 
equation analogue of Laplace’s equation: 


u(x,y) = 720 {4[u(x + Ay) + u(x — hy) + u(x, y + A) + u(x, — A)] 
tulx thy $A) + u(x — hy +h) tule thy — A) 
tu(x — hy — A)}. 
In this case, although for rectangular regions the method of separation of 
variables can be used, the formulas for the eigenvalues which determine 
the convergence rate of the successive-overrelaxation method are very 
complicated and do not appear to have been successfully used. 


By Garabedian’s method, the optimum value of w for a problem 
involving the unit square with mesh size A is given by 


6) 


as compared with 


for the 5-point formula. 
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Since the factor of reduction of the error for each Ume step Iteration is 
eof rau ar 
== ? 
then the smallest value, corresponding top = I, 18 
— vat? 
CTs 
According to this, one could obtain arbitrarily rapid convergence by 
choosing r sufficiently large. 

On the other hand, we know that, with r substantially larger than !2, 
the method defined by “11.129 will not even converge. In order to 
study this situation further, let us retain all terms in the Tavlor expan- 
sions. By 11.115, we have 

s k*™ Oy x h2m O27 u 


Daa or 


(2m ' 6x?" — 


Letungu =e “sin prx, we get 


x. m 7 a | jam | 
2 (27 ae > (—1,"—— ¢ rae 


Bok ‘ J's 
wn: ogee (2m)! 


e 1 = O(cos pah — 1) = —4F sint 
; . ,pah 
e * — |] —4r a al 


We remark that one could obtain this same result by substituting u = 
e "sin prx directly into (11.129). 
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Evidently, if r is substantially greater than 42, the quantity 1 — 4r 
sin? (pzh/2) is greater than one in modulus for some f, and the method 
may not converge. It is clear that the procedure based on the use of 
(11.131) is not applicable. 

We note that the condition r < }% is just the condition which must be 
satisfied in order for (11.129), considered as a numerical procedure for 
solving the diffusion equation u, = u,,,to bestable. Itis probably true 
that, before attempting to use the method of Garabedian, one should 
first determine limits on the iteration parameters based on stability con- 
siderations. 
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12.1 Some Examples of Integral Equations; Classification 


The study of integral equations originated in the nineteenth century. 
Their practical application and numerical methods for solving the 
equations are much younger; most of this work has been done in the 
last decades, and more research is still to come. In modern perspective, 
integral equations appear as a special case of the problems in functional 
analysis. This is not to the disadvantage of the numerical analyst. 
The methods of functional analysis facilitate both the analysis and design 
of numerical procedures. For instance, the analogue of the familiar 
Newton process is important for the solution of nonlinear integral 
equations. (See Chap. 14 and Kantorovitch [61].) 

Here we are concerned with linear integral equations. The following 
examples are to serve the purpose of a first orientation. 


(a) Cascade Flow 


We consider the plane flow of an incompressible fluid through a 
cascade of congruent profiles. The contours of the profiles in the 
complex plane of z = x + zy are denoted by C’,, (Fig. 12.1), where n 
runs through all positive and negative integers. The contours are 
simple; they do not intersect with one another, and they are so 
arranged that a shift of Cy by ali produces C,. The complex points 
¢ on the contour Cy shall be related to the arc length s by means of a 
function ¢ = f(s), which we assume to have continuous derivatives up 
to the second order. Any function w(z), holomorphic in the domain 
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outside the contours, with period fi, bounded and continuously ap- 
proaching boundary values on the contours, can be represented by 


w(z) = Wy — ae | coth (t — 2) eld de, znotonC,, (12.1, 


with wg» as an arbitrary constant. 
Formula (12.1) is a consequence of 
Cauchy’s integral formula applied 
to one contour in the image plane 
Z = exp (27z/l). We interpret 
w(z) = u — iwas the complex veloc- 
ity of a field of plane flow with u 
Fic. 12.1 as the component in the x direction 
and v as the component in the 

direction. Relation (12.1) can be rewritten as 


w(Z) = Wy — = | coth 5 (€ — z)-[v(s) — 26(s)] ds, (12.2) 


with y(s) = Re[f'(s)w(¢)],  6(s) = Im[f"(s)w(E)]J. (12.3) 


The quantity y(s) is the velocity component tangent to Cy; d(s) is the 
velocity component normal toC,. Ifthe contours are to be streamlines, 
6(s) must vanish, and if z approaches a point r = f (t) on Co, the well- 
known Plemelj formula yields 


loy(t) = f'(t)wy — ai [con 7 (f — “) f(s) ds. (12.4) 
This is a condition on the component y(s), and the problem of finding a 
flow field w(z) through the cascade involves solving (12.4) for y. We 
call (12.4) an integral equation for the unknown function y; since the 
integrand of (12.4) is a linear function of y, we speak of a linear integral 
equation. Westill have to explain the concept of the integral involved. 
In (12.1) the integral can be interpreted as a Riemann or as a Lebesgue 
integral; both mean the same in the case (12.1), where the integrand is 
a continuous function. In general, we adhere to the concept of the 
_ Lebesgue integral. But (12.4) requires more explanation because of 
the singularity of the integrand at € = 7. Let us exclude from the 
integration that portion of C, for which | — 7| < e and let us denote 
the integral so obtained by J(t,e). Now we go to the limit e +0. If 
the limit J(t) of J(t,e) exists, we define J(t) as the integral in (12.4). 
J(t) is known as the Cauchy principal value. This value exists if 
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y(s) satisfies a Holder condition on Cy. Quite generally, this condition 
means 


|F(P) —F(P’)| <A|P—P'",  O<p<l, PP’eS (124’) 


with regard to a function F(P) defined on a set S of the euclidean space 
E,, with |P — P’| standing for the distance of the points P,P’; here A 
and yu are constants. 


Let us now split (12.4) into real and imaginary parts. Setting 


= 5 £0 coth (C — 7r) = K(ts) + iL(t5), (12.5) 
2wo f(t) = g(t) + ih(t), (12.6) 
we obtain 
y(t) =| K(ss)r(s) ds + 8(0, (12.7) 
0 =| L(t,s)y(s) ds + Alt). (12.7’) 


These are two linear integral equations for the unknown function y(s). 
It can be shown that they are equivalent to each other. This interest- 
ing fact can be related to the regular flow fields which Eq. (12.2) defines 
in the simply connected interiors of the contours C,. Relation (12.7) 
means that the flow field inside Cy has a vanishing tangential component 
along Cy; (12.7’) states that the inside field has a vanishing normal 
component on Cy. A vanishing tangential component implies a 
vanishing interior flow field, and so does a vanishing normal component. 
Hence (12.7) and (12.7’) mean the same. 

The so-called kernels K(t,s) and L(t,s) exhibit the following features: 
K(t,s) 1s continuous; L(¢t,s) admits the split 

] 

L( ks). = Cae + H(t,s) (12.8) 
with a continuous function H. Because of the first term in (12.8), the 
integral equation (12.7’) is called singular. We do not go into more 
detail. 

The preceding examples (12.7) and (12.7’) have been given for two 
reasons. They show an equivalence of certain “singular” and “regu- 
lar” integral equations. They also demonstrate that boundary-value 
problems of partial differential equations such as the calculation of 
flow fields can be reduced in dimension to problems involving the 
boundary only. The more difficult boundary-value problems of 
elasticity can be handled in a similar vein, as can be seen. from 
Muskhelishvili’s books [25, 26]. 
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(b) Renewal Theory of Statistics 


In an electronic device with a large number of tubes of the same type 
but of different age, tube failures will occur, and failing tubes will be 
replaced by new ones. Let /(¢) be the relative rate of tubes failing at 
age t. We want the total relative rate r(¢) of failing tubes at time ¢ > 0. 
The function r(t) satisfies 


r(t) = g(t) +{ fe — s)r(s) ds. (12.9) 


This expresses the fact that renewals of tubes apply to both the original 
ones (g) and to tubes installed at former times s > 0 (see Feller [14]). 
Relation (12.9) is a linear integral equation for the unknown function 
r(t). The interval of integration depends on ¢; this is the most signifi- 
cant difference between (12.9) and (12.7). The ¢ dependence of the 
interval of integration is typical for certain initial-value problems. 


Classification of Integral Equations 


If the unknown function appears under the integral sign only, we 
speak of an integral equation of the first kind. Equation (12.7’) is of 
this type. If the unknown function also stands outside the integral, 
as in (12.7) and in (12.9), we say that the integral equation is of the 
second kind. From now on we write the second kind in the form 


o(5) = Af Kis) at +/00 (12.10) 


with s,f as real variables and a,b (a < 5) as finite limits of the interval 
of integration. The unknown function is y(s); K(s,¢), f(s) are given; 
A is a parameter. All these quantities may assume complex values. 
Furthermore, we confine f (s) to the class Z,(a,b) and K(s,t) to the class 
L,(D), where D stands for the square a <5,¢ < 6. We look for solu- 
tions y(s) of the class L,(a,b). Under these restrictions, Eq. (12.10) will 
be referred to as the Fredholm integral equation. This covers the case 
(12.7). Ifthe kernel K(s,¢) in (12.10) vanishes for ¢ > s, we may write 


o(6) = AL K(s,7(0 dt +f); (12.10 


this particular case is known as the Volterra integral equation. For 
reasons which are given later, the parameter A plays no essential role in 
(12.10’); therefore this equation is usually written without the param- 
eter A. The Volterra equation appears very often with a “‘dif- 
ference” kernel K(s,t) = F(s — t); Eq. (12.9) comes into this category. 
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We write integral equations of the first kind in the form 


[Kom a= 09; (12.11) 


many important equations of this type go with a singular kernel of the 
form 


K(s,t) = — 4+ H(s,t),  ¢ = const; (12.12) 


the function A(s,t) is “less” singular than the first term in (12.12); very 
often it is continuous in the open square a <5,t <6. The integral in 
(12.11) is to be taken as the Cauchy principal value. 

The main subject of this chapter is the numerical solution of 
Fredholm integral equations. However, a few remarks about singular 
integral equations of the first kind are given in the next section. 

In general, the numerical methods are explained without proof of 
their validity, for which the reader is referred to the literature. 


12.2 Integral Equations of the First Kind 


We consider the singular type (12.11), (12.12). Sometimes it is 
useful to look for an equivalent equation of the second kind and to 
attack it by numerical methods. An example is Eq. (12.7’) with its 
equivalent (12.7). As another example, we take 


~ ‘ x Hs) | a(t) dt =f (s) (12.13) 


s—t 


under the restriction that f(s) and H(s,t) satisfy Holder conditions in 
their closed domains of definition. Introducing 


lf 
ls) = | Hlasiy(6) & (12.14) 
27 J-1 
we rewrite (12.13) as follows: 


~ | 208 = 4) — als. (12.15) 


-15S —t 


Following Schmeidler [37], we write the solution to (12.15) in the form 
1 


fT ] (x) — g(x) Be 2\—! 
I(t) =— Q() — = Q(t) ¥ 2n0G)” Q(t) = (1 — #)~4 
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where [ is an arbitrary constant. Relation (12.16) can be given the 
form 


1 
x) =| L(es\y(e) ds + HOO, (12.17) 
= 
with 

r ] 1 f (x) 
h(t) =—- —-- ON) ————_—_ a 2.18) 
4 T a wT Q() -1(t — x) Q(x) a, er 
l | H{(x,s) 
= \ —— 9 Q\ 
L(t,s) 722 Qo. c=70@ dx. (12.19) 

The further transformation 
y = 2Q" (12.20) 


results in a Fredholm integral equation for z(t). Some direct treat- 
ments of singular integral equations of the first kind have been suggested 
by various authors. Berg [3] deals with methods ofiteration. Another 
direct approach uses the Hilbert space H of the real- or complex-valued 
functions of L,(a,6) with the scalar product 


b 
(f8) -| S (s)g(s) ds, = g = conjugate complex value of g, (12.21) 


and defines the integral operations 


6 b 
[Koa dt = Ky, [ Kea di = K*y (12.22) 


for sufficiently smooth functions y € H; this definition is extended to the 
other elements of f/ by a process of closure. The definition has to 
satisfy certain conditions: Ky, K*y must be elements of H; K and K* 
must be linear operations, that is, in the case of K, 


K(x, 91 + @ 32) = aK, + a, AD (12.23) 
for any two constants a, «; and lastly 
(Afig) = (f,K*g). (12.24) 


Schmeidler [37] proves that a necessary and sufficient condition for the 
possibility of defining Ky, K*y this way is the existence of two complete 
and orthonormal sets of functions ¢,, ¢2,... and y,, yo,... of Af for 
which Kg, and K*y, can be defined as elements of H in accordance with 
(12.24) and such that 


(AZEl < MUS U-igl, = IS = (4S) (12.25) 
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for any two linear combinations 


f=Sat 8 = 3 dive (12.26) 
with Af independent of n. Orthonormality means 
(059%) = 94 = Kronecker’s symbol, (12.27) 
and completeness requires 
f =0 if (fi¢,) = 0 for all k. (12.28) 
If the two systems exist, the general definition of Ky, K*y is by 


x 


Ky =D 0.9, kon Kt = D (Hn) KY ne (12.29) 


n=] 


The definition implies that K is bounded; that is, 
Kyl < M |i, M independent of ». (12.30) 


The smallest possible Af is called the norm of K and denoted by ||K|. 
K and K* have the same norm. The integral equation (12.11), now 
written as 


Ky =f, (12.31) 
is treated by means of the following infinite system of linear equations: 
SK3 J, eae ee eee (12.32) 
jel 
where 
Ki; = (Aq iv): Si = (Ai) I; = (1,93): (12.33) 


Equation (12.31) admits a solution_y € if and only if (12.32) admits a 
solution j,, J2,... With a convergent sum 


2! yl. (12.34) 


The numerical treatment of (12.31) can be achieved by solving the 
finite system 


> Ki; = 5 1 — l, Dae sds (12.35) 
jel 
for the unknowns 7; and by using 
Y,, = 2 Is (12.36) 
j= 


as an approximation to 7. Here is the place to mention that Eq. 
(12.7’) was originated by Isay [17] and that his way of numerically 
solving it is based on (12.35). 
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Equation (12.13) for polynomials H(s,t) has been treated in similar 
vein. To this end, (12.13) is transformed by the substitution s = cos 5’, 
t = cos?¢’, and Fourier expansions are used in order to construct a 
solution. For details the reader is referred to Reissner [33] and 
Weissinger [50]. 

It is not necessary to work with orthonormal systems of functions ¢;. 
Sometimes it is quite useful to select a set of functions u,, uy, ... for 
which the operations Ku, can be carried out in closed form. One 
approximates f by a linear combination 


fe > Gre v, = Ku,, (12.37) 
k=1 


n 
and takes > c,u, as an approximation to ). 
k=1 


12.3. Theorems and Formulas on Fredholm Integral 
Equations 

In the case of Fredholm integral equations the operations (12.22) are 
Lebesgue integrals; with respect to the Hilbert space H, as introduced 
in the preceding section, the operations Ky, K*y satisfy (12.23), (12.24), 
(12.30). In addition, they are completely continuous, which means 
that K, for example, transforms any infinite sequence /;, fo, ---5/n--- 
of elements of H with uniformly bounded norms || /,|| into a sequence 
Kf, = g, such that a convergent subsequence g,, exists; that is, an 
element g of H exists such that 

lim |lg — g,,ll = 9. (12.38) 

The theory of completely continuous operators is well developed; we 
refer the reader to the excellent book by Riesz and Sz.-Nagy [35]. We 
quote some of the already classical results on Fredholm integral 
equations (12.10), and from here on we sometimes write these equations 
in the form 


(I —AK)y =f, (12.39) 
where J stands for the unit operator Jy = ». 


(a) Fredholm’s Alternative 


For a given A, either the equations (J — AK)y = fand (I — AK*)z = 
have unique solutions y, z, or the homogeneous equations (f = 0, g = 0} 
have nontrivial solutions. In the latter case the number of linearly 
independent solutions is finite and the same for both equations. If the 
homogeneous equations have nontrivial solutions, the inhomogeneous 
equations admit solutions for special elements f, gonly. Necessary and 
sufficient for the existence of a solution is that f be orthogonal to all 
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solutions of (J — A4K*)z = 0 and that g be orthogonal to all solutions of 
(I — AK)y = 0. 


(6b) Eigenvalues, Eigenelements 


Nontrivial solutions of (J — AK)y = 0 exist for an at most enumerable 
set of values A,, 4o,... of the parameter A. These cannot accumulate 
to a finite point of the complex A plane. The values A are called the 
eigenvalues of the operator KX, and the corresponding nontrivial solu- 
tions of the homogeneous equation are known as eigenelements or 
eigenfunctions. If A is an eigenvalue of K, 4 is an eigenvalue of K* 
and vice versa. The Volterra integral equation (12.10’) has no eigen- 
values. 


(c) The Operator K(A) and the Neumann-Liouville Series 


The product XL of two integral operators K, L is defined by (KL)y = 
K(Ly); in the case L = K one writes KK = K?, KK? = K3, etc. The 
product of completely continuous integral operators is also a completely 
continuous integral operator. If is not an eigenvalue of K, the solu- 
tion to (12.39) can be written as 


y= [I+ AR(A)IS, (12.40) 


with A(A) standing for a completely continuous integral operator. We 
have K(0) = K. With reference to another value » of the parameter 
which is not an eigenvalue, the relation 


K(a) — K(u) = (2 — w)K(A)K(u) = (A — w)K(u) (A) (12.41) 


holds. For sufficiently small values of |A|, the operator K(A) admits 
the representation by the so-called Neumann-Liouville series 


K(A) = > A"1K", (12.42) 
n=] 
which converges in the sense ||K(2) — > A"-1K"| +0 when m— oo. 
n=1 


A sufficient condition of convergence is ||AK|| <1; more precisely, 
convergence takes place for |A| < {A,| where A, denotes an eigenvalue 
with smallest modulus. For 0 <r < |A,| a number R independent 
of n exists such that 


|K"| < Rr, (12.43) 


for continuous kernels K(s,t), the kernel K,,(s,t) of K” satisfies 


1K,,(5,t)| < Ri’ (12.43’) 
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with a number R’ not depending on s,t and n (see [4], [5]).  Equa- 
tions (12.43) and (12.43’) can be used as a basis for the development of 
iterative methods. 

(d) Operators of Finite Rank 


Kernels of the form 
K(s,) = Sy (sa (0) (12.44 
k=1 


are called degenerate or of finite rank. There is no lack of generality in 
assuming that the functions u,, u,,..., u,, as well as 2, v2, ..- , U,, are 
linearly independent. Solutions of (12.39) with (12.44) have the form 
y =f +e, + cylg +--+ + ¢,u,, with constant coefficients c,. These 
are determined by the following set of linear equations: 


C, —A>dc;(vo,,4,) = A(v,f), Ree Peay “CE2AD) 
i=1 
This problem is in the realm of matrix algebra. 


(e) Approximation of Operators 


Any completely continuous integral operator K can be approximated 
by a suitable operator L of finite rank such that the norm ||K — L! is 
below a prescribed quantity « > 0. This approximation is of practical 
importance; it provides the basis for designing numerical methods. 
One of these methods is as follows. 


(f) The Procedure by Erhard Schmidt 
Let L be of finite rank and such that ||AM|| < 1 with Af = K — L. 

Equation (12.39) can be reduced to 
= [J + AM(A)] f + Ny, N=[2+AM(A)]L. (12.46) 


M(A) can be represented by the Neumann-Liouville series; the operator 
N is of finite rank; hence solving (12.46) amounts to solving a system 
(12.45). 

Note. G. Kron’s widely publicized method of network tearing is in 
essence a special application of Schmidt’s procedure. Let us write 


A=I-3%K, B=I-ij4M, BO=I1+AM(d), L=B-A, 
and 
Ay=f, By =ft+(B-Aly;,; y= BYU4+ BUB— Ap; 


the last relation is identical with (12.46). The given problem requires 
the inversion of the operator A in order to find » = A-}f; Schmidt 
selects an operator B which offers two advantages, namely: 
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1. The inversion of B is easier than the inversion of A in the sense 
that B-! admits a convergent Neumann-Liouville series. 

2. The operator B-!(B — A) is degenerate. 

In the method of network tearing, the operators A, B are nonsingular 
m Xn matrices. The matrix B is more easily inverted than A; the 
matrix B-1!(B — A) =C has a rank r smaller than n. This is in 
accordance with advantages 1 and 2. In order to make use of r < n, 
one may set C = ST, where S has 7 columns and n rows and T has r 
rows andacolumns. The solution of the auxiliary system y = g + Cy 
with g = B-'f is achieved by introducing the column vector z = Ty 
of r components, by solving z = 7g + 7T%Sz for z, and finally by y = g 
+ Sz. Actually, the preceding steps have nothing to do with networks; 
what makes certain linear networks a specialty is the fact that the 
selection of B is easy and that the split C = ST is also found in an easy 


way. 
(g) Splits of K and of K(A) 


With reference to any eigenvalue of K, say 4,, K can be split into 
two operators A, B such that 


K=A+B, AB=BA =O, (12.47) 


where A is of finite rank with A, as the only eigenvalue, whereas B has 
the same eigenvalues as K but not 4,.. The split implies 


K(4) = A(A) + BA), A(A)B(A) = B(A)A(A) = 0; (12.48) 
B(A,) exists; A(A) has the form 


A(A) = 9 A,(A — A), (12.49) 
i=1 
where A,, A,,...,A, are integral operators of finite rank. The 
operator A = A(0) can be given the special form 
: ] 
Au = > Gin (U; Pu) ai FT) (12.50) 
ikl Ay 


where the constant coefficients a,, form a matrix of Jordan canonical 
form and the elements ¢,, y, of H form a biorthonormal system; that 1s, 
(FiPe) = Sine 
(h) Operator Polynomials 

Let f(x) =fo + fix +°°° +/,x" be any polynomial. With respect 
to the operator K, we define f(K) =f/p/+/AK +:::+/,K" as 
another operator. It is easily seen that f(x) + g(x) = A(x) leads to 


f(K) + g(K) = A(K) for any two polynomials f, g. Likewise, f(x) g(x) 
= h(x) results in f (K)g(K) = A(A). 
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If A is an eigenvalue of K, we also write « = 1/4; « will be called a 
characteristic value of K. We rewrite the eigenvalue problem in the 
form Ky = xy. This form makes it possible to cover the case x = 0. 
Ky = Ocan admit a finite or an infinite number of linearly independent 
solutions y. It follows from Ky = «xy that f(K)y = f(«)y; if, vice versa, 


J(K)z = pz, z#0, (12.51) 
en haf (k); (12.52) 
with « as a suitable characteristic value of K. 


12.4 Hermitian Operators 


If K = K*, the operator K is called hermitian. XK has at least one 
eigenvalue; all eigenvalues are real. Eigenelements of different 
eigenvalues are orthogonal to one another. One finds » = | in 
(12.49). Let us order the positive and negative characteristic values 
of K in sequences 


Ky, Kay 2 0 0 Kny eo 05 K_4) K_95 2 + +5) K_ny ee » 


of decreasing modulus such that «, stands as often in the sequence as 
there are linearly independent eigenelements. The eigenelements 
shall be 91, ¥o.--+5Jny--- and y_y,¥_9,---5J_ny+++, respectively. 
Without loss of generality it can be assumed that all eigenelements form 
an orthonormal system. The set of eigenelements of characteristic 
values different from zero is complete if Ky = 0 has the trivial solution 
only. For any positive integer n and any element u of H, Hilbert’s 
fundamental formula 


(K"u,u) = 2 | (us Iie) [? ey” (12.53; 
holds; this implies, for n = 1, 
Ky = mas (Ku,u), win lull = 1; (12.54 
k_, = min (Ku,u) with |u|| = 1. 


Maximum and minimum are attained for eigenfunctions of «, and 
K_,, respectively. The other values «, can be characterized as solu- 
tions of the following minimax problem (Courant [12]). Let x, 
Us, ..., v, be elements of H and confine u to those elements which satisfv 


lui] = 1 and (u,v,;) = 0 0) ee oe ere 
By variation of u, we define 


F , (4,02). ++ 50,) = sup (Au,u), F’_(v4,0q, +. 5?) = sup [—(Ku,u)] 
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as functionals of the v,; these functionals give rise to 


Kei, = minF,, K_,-) = min —F_. (12.55) 
% 


U; 


The kernel K(s,¢;4) of K(A) admits the expansion 


K(ssa) = K(sf) +4 ila (12.56) 


The split (12.47) with the characteristic value yu results in 


A(s.t) = #X Il)5(0)- (12.57) 
Special hermitian operators are those with a real kernel K(s,t) = K(t,s). 
Their eigenfunctions can be taken as real functions. Insome important 
applications one meets kernels L(s,t) = p(t)K(s,t), where K is real and 
symmetric in s and ¢, the function p being nonnegative. The operation 
Ly can be linked to a hermitian one by means of 


p(s) [ Lotly(0) dt = [pH PMOK N20) at, 2 =p. (12.58) 


12.5 Inclusion Theorems for Hermitian Operators 


Theorem 12.1. For U = (Ku,u) with |u| = 1, each of the real x sets 
x —U2>0,x —U <0 contains at least one characteristic value of K. If 
U 1s not a characteristic value, then each of the sets x —U>0,x —U <0 
contains at least one characteristic value of K. 

Because of the importance of this theorem for practical applications, 
we shall prove it. 

Proof. The theorem is trivial for (K — UJ)u = 0;if (K — Ulu + 0, 
we distinguish the following subcases: 

(i) KX has eigenvalues of both signs; then «, > U > x_,, from (12.54), 
and the theorem 1s already proved. 

(ii) K has eigenvalues of one sign only, say positive ones. (K — UlI)u 
#~ 0 implies Ku 40 and (Ku,Ku) = (Ku,u) > 0. It follows from 
(12.53), n = 2, that (u, y,) 4 0 for at least one index k, and therefore 
U>0. We find that x — U> 0 contains «,; if there is an infinite 
number of characteristic values of K, they accumulate at zero, and 
x — U <0 contains an infinite number of them; if there is a finite 
number of characteristic values of K only, then x = 0 must be one of 
them, and this one belongs to x — U < 0. This completes the proof. 
In the case of K having negative eigenvalues only, the reasoning 1s 
analogous. 
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With respect to any element u € H with the norm |u| = 1, we intro- 
duce the so-called Schwarz constants 


ay = (usu) = 1, a, = (K"u,u), n> 1. (12.59; 


These constants are real. We take a system of real coefficients /,, 


Sis +++ 5Jn Such that . 
> Siete = 9 (12.66: 
k=0 


and set up the polynomial f (x) = fp + fax +°°+ +/,x"; with respect 
to f (x), we may state the following theorem. 

Theorem 12.2. The polynomial f (x) has at least one real zero. Each 
of the real x sets f (x) > 0, f (x) < 0 contains at least one characteristic value 
of K; if f (K)u 4 0, nonempty sets f (x) > 0, f (x) < 0 exist; each of them 
contains at least one characteristic value of K. 

Proof. If f(K)u = 0, it follows from what has been stated about 
(12.52) that a characteristic value « of K exists such that f(«) = 0. In 
this case the theorem is trivial. If f(K)u 4 0, we apply Theorem 12.1 
to the hermitian integral operator L = g(K), where g(x) =/f (x) —/o. 


n 


We find (Lu,u) = > a, f, = —fo; hence each of the real _ysets y + fg > 0, 
K=1 
y +fo < O contains at least one characteristic value of L. The charac- 
teristic values of L have the form g(x), where « runs through the charac- 
teristic values of XK. There exist two characteristic values «’ and «” of 
K such that g(x’) + fo =f (k’) > O; a(n") + fo =f (x) < 0(Q.E.D.). 
In what follows, any polynomial f (x) with real coefficients, satisfving 
(12.60), will be called an inclusion polynomial. The condition (12.60) 
can be written as 


(f (K)u,u) = 0. (12.607) 


Iff(K)u 4 0, we speak of a proper inclusion polynomial. If /(A)u = 0, 
we call f(x) an improper inclusion polynomial. If f(x) and g(x) are 
inclusion polynomials with respect to u, then af (x) + dg(x), with anv 
two real coefficients a, 5, is an inclusion polynomial. If no improper 
polynomial f (x) + 0 exists, we say that u has infinite degree with 
respect to K. If improper polynomials f (x) # O exist, there is one of 
minimum degree m; disregarding a constant factor, this “minimum 
polynomial” is unique. We call m the degree of u with respect to A. 

Let us assume that the Schwarz constants do, @),..., 42,2 admit a 
_ nontrivial solution of the following equations: 


Onfo + Mia ttt + einai = 9; k=0,1,...,2—1. 


(12.61, 
Then both the polynomial F,_, = fo + fix +-°° +fa-1*"7? and the 
products x‘F,,_,, with A = 1, 2,...,2 — 1, are inclusion polynomials. 
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Any linear combination of them with real combination coefficients is an 
nclusion polynomial; hence F?_, is an inclusion polynomial. But 
(F232, (K)uju) = ||F,_,(K)ull? = 0 and therefore F,_,(K)u = 0. 

(12.62) 


Ihe case (12.61) then implies that u has a degree less than n. From 
AoW on, we assume u to have degree m >n. Then the polynomial 


] x x? 
ag a, ee @ a,, 

on(x) = | a ade mo An+y (12.63) 
Gn-1 Ane Gon-1 


has proper degreen. Obviously ¢,, x*¢,, with k <n — 1, are inclusion 
polynomials. Any polynomial g(x) with real coefficients and with 
degree less than n gives rise to an inclusion polynomial g(x)¢,. The 
polynomials ¢o, ¢,, .. . are orthogonal (see Szegé [45]), and they have 
real and distinct zeros only. If¢, is a proper inclusion polynomial with 
ZeYOS X; < xX» <*** < x,, then the intervals (—00,x,), (x,,%9),..., 
(x,,00) contain characteristic values of K. This can be proved by 
considering the polynomials g(x)¢, with g as a divisor of ¢, (see [5]). 

The preceding results seem to have been originated by Wielandt. 
In 1950 the present writer discussed the results with Wielandt, who then 
stated that his earlier work on hermitian matrices had led him to the 
inclusion polynomials. But the polynomials ¢, also show up in papers 
by Lanczos [19, 20, 21]. In 1954 Vorobyov [60] published results 


about using the polynomials ¢, for the approximate calculation of the 
characteristic values of hermitian operators. 


Special inclusion polynomials are of the type 
SF (x) = Sx + feraxPt? + fezerh??. 


These lead to the following special inclusion theorem. 
Theorem 12.3. [fk 1s even, the two numbers 


P = real and Q = (a — Paysy)/ (Gea. — Payse); 
Uf O44 — Pay. # 9, 
bracket at least one eigenvalue of K. 
A special case of this theorem minimizes |P — Q|; the result is 


P,Q = O41/4,42 + d where d? = a,/a.,2 — (4e41/@g42)*. (12.64) 


The analogue of (12.64) for differential operators was found by Wein- 
stein. Wielandt found the same result for hermitian matrices. 
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Many more results about the inclusion of characteristic values are 
available. The reader is referred to [5]. 


12.6 The Classical Iteration 


Many iterative methods, which have been designed either to deter- 
mine the eigenvalues and eigenelements of an integral operator K or to 
approximate the solution or solutions of the equation (J — AK)y =f, 
derive a sequence of elements u, of H from an initial element wu, by 
means of polynomials q,,(x) of degree n, such that 


u, = q,(K)uo. (12.65) 

The iteration involves in particular the elements 
v, = K"ug, (12.66: 
which we consider first. The split (12.47) with respect to A, implies 
Kn = A" + B, (12.67, 


Let 4, be the only eigenvalue of smallest modulus. Application of 
(12.43) with respect to B gives 


A, "(A" — A")ug 20 when n> 0; (12.68) 
hence in the case Aug ~ 0 the asymptotic behavior of the sequence 
(12.66) is displayed by the sequence 

WwW, = AU. (12.69, 


The study of (12.69) is in the realm of matrix iteration because of the 
finite rank of A [see (12.50)]. The minimum polynomial of the 
matrix (a,,) of coefficients in (12.50) implies a corresponding relation 


(A — Al)’ = 0; (12.70) 
consequently > (;) (Aa) =A), (12.713 
K=0 


This relation holds approximately when zv, is substituted for z,; the 
error of the approximation goes to zero when noo. This provides 
the basis for finding information about A by means of the sequence 
(12.66). In the important case r= 1 the functions z,,v,., will 
become proportional to each other as n+ oo. The proportionality 
constant is A,. So far as the inhomogeneous equation (J — AK)y = / 
for || < |A,| is concerned, it can be solved by the iteration 


Un., = AKu, +f. Clr 
The error is 


Unig —J = AK(u, — 9) = (AK)"* (uy — 7); (12.73, 
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Eqs. (12.68) and (12.69) imply 
l|z,, — || ~ 0 when n — oo. (12.74) 


This holds in the sense of norm convergence. A stronger result can be 
obtained for continuous kernels, where the functions w,,,(5) —_y(s) 
converge uniformly to zero fora <s <b. 

In the special case ug = f the iteration leads to 


Mae 


AK, (12.75) 


ul, = 
a 


I 


0 


which means that u, is a partial sum of the solution y due to (12.40) and 
(12.42). 

A limit case of the iteration appears for A = 4,, with f/ assumed to 
admit a solution of the integral equation. Convergence takes place for 
Af = 0 and uy=/. The functions u, converge to the solution of 


(1—A Bly =f. (12.76) 


One of the best-known examples is furnished by the limit case / = o 
of the integral equation (12.7), which goes with the kernel 
la 
BES) nm On, 

with 7,, denoting the distance of the points s, ¢ and with n, standing for 
the inner normal at point s. All eigenvalues are real; 4, = 1 is the 
smallest in modulus, and the kernel A(s,t) of A in (12.50) takes the 
form A(s,f) = ,(s). The integral operator is also related to integral 
equations of Dirichlet’s problem for the exterior of Cg as well as of 
conformal mapping of the interior of C, onto the interior of the unit 
circle. The iteration (12.75) for this example has been widely explored, 
in particular by Warschawski [49] and through computational experi- 
ments by Todd and Warschawski [46]. Since the rate of convergence 
is |A,/A,|, where A, is next to A, in modulus, bounds for A, have been 
given, especially for nearly circular contours; the papers cited also give 
estimates of the error of iteration. 

Iteration in the case A = A, with the kernel (12.77) can also be 
carried out in the following way. We modify the integral equation by 
writing 


r) =[ (Xs) — My) +P +e), P=] ya (12.78) 


The new kernel XA(t,s) — 1 has lost the eigenvalue A, = 1; it still 
has all the other ones of X(¢,s). Therefore the method (12.72) will 
work; the constant I‘ can be fixed arbitrarily. The modification (12.78) 


log 7,15 (12.77) 


Google 


456 SURVEY OF NUMERICAL ANALYSIS 


and other ones have been designed by Wielandt in order to overcome 
the restriction |A| < |/,| (see (54, 5],. 
12.7 Iteration Polynomials 


We return to the more general sequence (12.65:; in particular we 
discuss the sequence uw, = 0, 


Un = On AK f, n= 1,2,...39,_,; = polynomial of degree n — 1. 
(12.79: 

We introduce the error 
Wy Y= —PvARiy, pak) = 1 > — Vigna'x). (12.89) 


The special case p, = x” characterizes the iteration of the preceding 
section. A refined iteration is due to Wiarda [53]. He sets 


uz, = ou, —(1—49/Ku, + (1 —Of (12.81; 
with a suitable constant 9. We find 
Uy —y = [9 + (1 —MAK](u, — 3}, (12.82) 
and the polynomial p,{x) is 
f(x) = [9 — (1 — 4x)". (12.83: 


Wiarda investigated the sequence (12.81) for the case of a real and 
symmetric kernel. It was shown by Buckner that the error converges to 
zero for any integral operator K and for any / if 


l}<1 and [94+ (1—OHda/Al <1, v=1,2,.... (12.84) 


The proof uses (12.43’) as a tool. The more general iteration method 


“1 = [9,3 = A(] ~~ 6,.,)A]u, a (1 — 6, Ss 6,-, aoe 6, 
(12.85) 
with an integer p > 1, has the polynomial 
p(x) = TT (9, + (1 —4)s. (12.86: 
ii 


The sequence (12.85) converges for 
1p,(9)| <1, lp,(Z)| <1, 2 Ae k= 12 coc C26 r 


If Ais not an eigenvalue, parameters 6,, 4,,..., 6, can always be found 
to enforce (12.87). In the case of real eigenvalues the special case 


w, = (1 —aley + AxKu, + of 


We = (1 + x)w,y — Axku, — af 
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leads to convergence for 
|i — a7] < ], [1 — a2(1 — z,?)| <1. (12.89) 


With a sufficiently small value of |x|, convergence can always be 
enforced. 

All these methods have been extended by Schénberg [38, 39] to 
bounded operators K in Hilbert and Banach spaces. He also found that 
an expansion of (1 — z)-! in polynomials of z, which converges 
uniformly in a simply connected domain of the complex z plane such 
that the values A/A; are in the domain and the point z = | outside, 
implies a convergent expansion of (J — AK)-! = J + AK(A) when AK is 
substituted for z. In the same vein, Bellman [2] has shown that 
regular summation methods of the geometric series of (1 — z)—! apply 
equally to the Neumann-Liouville series. His publication is con- 
cerned with an hermitian operator. 

Schonberg’s results provide the basis for very general iterative 
methods. One may generalize (12.71) by setting 


a an (1 ~ On1i)AKw, =i En41Wp + (On+a a tna1)Wn-1 + (1 = Ona t 
(12.90) 
with suitable coefficients 6,,¢,. This iteration appears in the above- 


mentioned papers by Lanczos. Stiefel [44] writes L = J — AK, intro- 
duces the residuals r, = f — Lw,, and sets 


r, = P,(L) f, (12.91) 


the polynomial P,(x) is related to p,(x) by P,(x) = p,(1 — x). Assum- 
ing that the characteristic values of Z are real, positive, and within an 
interval (a’,b’), he recommends using the polynomials 


cos nt b’ + a’ — 2x bo +a’ 
P,,(x) re ear ne? cos f pp ae cosh z ~ FF - i 
(12.92) 


They have the property P,(0) = 1; among all polynomials of this 
property, the polynomials (12.92) have the least maximum absolute 
value in the interval (a’,b’), which makes them Chebyshev polynomials 
in a generalized sense. Therefore the choice (12.92) admits an opti- 
mum liquidation of the residuals. The reduction of the residuals goes 
with the order P, ~ 2e~™ for large values of n. The coefficients of 
(12.90) can be found from the recurrence relation of the polynomials 
P,,- Stiefel discusses in detail some numerical aspects of the method 
' with regard to the kernel (12.77). 
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This is not the end of possible iterative methods. Wagner and 
Samuelson have reported on certain modifications of the preceding 
methods; Lonseth has given another survey of iteration methods, 
together with computational experiments. Weissinger has found a 
general criterion of convergence, related to operator equations defined 
for abstract sets. ‘The quadratically convergent iteration method of 
Schulz, suggested for matrix problems, can be extended to integral 
equations. The reader is referred to the bibliography at the end of this 
chapter. 

The sequence (12.65) can also be used to obtain information about 4 
in the split (12.47), in particular with respect to eigenvalues other than 
A,. The polynomials g,(x) may be so chosen that, with respect to a 
certain characteristic value « of K, q,(«) prevails over the values of ¢, for 
the other characteristic values. This can be done on the basis of a first 
guess on «; if A, is already known, the polynomials will be chosen such 
that g,(«,) = 0, and the iteration will lead to information about 
another characteristic value. The special methods mentioned above 
have been adapted and used to this purpose for both integral and 
matrix operators. 

In the case of hermitian operators, polynomial iteration for the pur- 
pose of finding eigenelements overlaps with the use of inclusion polv- 
nomials. Some typical results for the asymptotic behavior of functions 
of the Schwarz constants a, may be mentioned. The ratios un, = 
a,-;/2, converge to an eigenvalue, say A,. The convergence involves 
[Hen-1) > |A,|; if X is definite, we even have |u,| > |A,|. In any 
case the inequality |>,_,| > |4e,/, due to Grammel, holds. The 
numbers fl, > 3M, 2 °*** are upper bounds of A,?, as has been 
pointed out by Collatz [11]. Also due to Collatz is the sequence 


Q, = Pa (12.93: 
Hy — As 
which for a positive definite K with 4, as the smallest eigenvalue and /: 
as the second smallest eigenvalue, converges to A, from below. ‘This 
can be generalized. The A, in (12.93) can be replaced by R,, provided 
that i i 
n—-1” Fn 
Re Ee ht) 
R,, — En 
(see Biickner [5]). All these and other general results can be easily 
proved by means of the inclusion theorems mentioned above. 
Quite another method of calculating the eigenvalues larger than 4, 
in modulus has been suggested by Wielandt [55]. He uses an approxi- 


mation 4* to a higher eigenvalue and iterates by solving the equation 
(I — A*K)w, = w,_1- 


Mn < Ry < Ag, as n> oo (12.94) 
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.2.8 Analogy Methods 


_ The numerical computation of the integral Ky goes in general with a 
ule of numerical quadrature of the type 


1) n 
[eas 3 pen = ale (12.95) 
ath certain weights ~,, fo, .-.-, P, and certain abscissas a < x; < %) < 


SRA he Se: 
By such a rule the Fredholm integral equation turns approximately 
nto the following set of simultaneous equations: 


n 


PA ee Ap,K,,)F;, =fis Kix = K(x,,%,), t= l, 2, cee fh 
(12.96) 


vith the quantities F, as approximations to y(x,) and with A replacing 
he parameter A. The treatment of (12.96) in the homogeneous case 

.; = 0 will furnish approximations to the eigenvalues of K as well as to 

he eigenfunctions. In the inhomogeneous case one will obtain an 

ipproximation F, to _y(x;) of a solution y. 

_ In the case of a real, symmetric, and continuous kernel K(s,t) and 
or equidistant abscissas x, = a + kh, h = (6 — a)/n = p,, Hilbert has 
shown that the suitably ordered zeros A,‘ of det (6, — Ap,X,,) con- 

-verge toward eigenvalues A of K as n — oo. 

With the aid of the new abscissas £, = a, 


k n 
& =a+P,(b —a) Dp, Pi = D Pis (12.97) 
i=1 t=1 
and of the orthogonal set of functions 


u(s) = 1 for Fuk: 2S Sey, u, = 0 elsewhere, (12.98) 
one may introduce 


fil) = Sf), Fa) = ZFul), (12.99) 


K,(6t) = Pa¥( — a)? S Kuti(s)ue(t) 


The set (12.96) is equivalent to the integral equation 


Pals) — AL Ky(sOF A) dt = flo) (12.100) 


which has a kernel of finite rank. It is natural to assume that 


n b 
r,(g) = 2 Psi —| eds —> © as n—+>o_ (12.101) 


for continuous functions g(s) if the quadrature rule is defined for all n. 
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It has been proved [5, 7] that 
lim K,(s,t) = K(s,¢), lim f,(s) =f(s) (12.102) 


in the sense of uniform convergence if f(s) and K(s,t) are continuous. 
It has also been shown, under the same assumptions on f and K, that 
F,,(s) converges uniformly to the solution » of (J — AK)y =/ if Ais 
not an eigenvalue of K. Furthermore, the suitably ordered eigenvalues 
of K,,(s,¢) converge to the eigenvalues of K, and a similar result holds for 
the eigenfunctions of K,(s,¢). The rate of convergence of the approxi- 
mation depends on the rate of convergence of (12.101). 

The choice of the rule (12.95) depends essentially on features of 
regularity of f(s) and of K(s,t). If both functions are continuous, but 
do not possess continuous derivatives, it does not help much to apply 
any rule with higher precision than the one investigated by Hilbert. 
If, on the other hand, f(s) and K(s,t) are analytic with respect to their 
variables, then higher precision rules should be employed. Nystrom 
has reported on the application of the Gaussian rule to an integral 
equation with the kernel (12.77) related to an ellipse asa contour. For 
the same kernel Weddle’s rule has been found satisfactory by Todd and 
Warschawski [46]. Prager used a five-ordinate Chebyshev rule in the 
same Case. 

Nystrom [28-31] suggested adapting the rule to the kernel K(s,t). 
More precisely, for given abscissas x,, Nystrom calculates the integrals 


b 
[ K(atem de Me Os Ven een ee A 


The function »(¢t) is approximated by its interpolation polynomial with 
ordinates y(x,) at the prescribed abscissas. Fox and Goodwin [15] use 
the trapezoidal rule, solve (12.96), then calculate an error term of the 
rule from this first approximation, and throw the error back into the 
system in order to calculate a correction to the first approximation. 

Volterra equations have been treated in a similar way. Huber [16] 
transforms the Volterra equation by approximating y(t) by a polygon 
function; the abscissas are equidistant. Wagner [47, 48] uses parabolic 
interpolation with respect to equidistant abscissas. 

In certain cases, K(s,¢) has continuous partial derivatives in f up to a 
certain order r, whereas the derivative of order (r + 1) jumps até = s. 
This happens to certain homogeneous integral equations which repre- 
sent vibration problems of one space dimension. The kernel A 1s 
related to or identical with the Green function of an ordinary differen- 
tial equation under boundary conditions. A higher-precision integra- 


tion rule for functions g(t), which have continuous derivatives up to 
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order (7 + 1), can be used if the system (12.96) is modified in the form 
ZU + Ad) da — AbKulhe = 0. (12.96) 
=1 


The term d, depends on the jump of 0"+'K/de'+!. 
In the case of hermitian operators K, error bounds of the method 
(12.96) for the approximations to the characteristic values have been 


found by Wielandt [59]. He assumes y fp; = 6 — a and an ordering 
i=1 
of the characteristic values of K and K, in the sense 


Ky > Ke >: SO, K4 Rig ee ee 2; 
K, 2 ky >? > 0. fe SS Rog oa 0; 


respectively. The error bounds are of the type |«; — «;| < M, with 
M independent of 1. The basic approach to these results uses the new 
concept of an hermitian kernel G(s,t) allowing an integration rule S. This 
means that the approximation by means of (12.96) gives the correct 
characteristic values of G(s,t). It is proved that a rule S, applied to K, 
admits the error bound 


Ix; — «| < IK — Gl, (12.103) 


where G can be any hermitian operator allowing S$. In practical appli- 
cations, G is constructed as an operator of finite rank, and an upper 
bound M for ||K — Gl| is established. A typical example refers to the 
case of equal weights and equidistant abscissas, which we mentioned 
before. Assuming a Lipschitz condition |K,, — K(s,t)| < Z[ p(s) + p(¢)], 
with p(s) = 5s — x, for x, <5 < x,4,,, Wielandt finds 


Ike — Ki] <CLIn, = C=KB4V%. (12.104) 


This is also the best result for a given Lipschitz constant L; that is, 
C cannot be replaced by a smaller constant. Other examples refer 
to the trapezoidal rule, the central-point rule, the Simpson rule, and 
the Gaussian method. 


12.9 Approximation of y by a Linear Combination of 
Given Functions 


The method described in Sec. 12.2 by (12.35), (12.36) can also be 
applied to Fredholm integral equations. So far, it has been customary 
to work with only one complete system u,, us, ... of linearly indepen- 
dent functions of H. One may approximate a solution y of either the 
homogeneous or inhomogeneous integral equation by a linear combina- 
tion 

VU = CU, + Coty + -°°* + 6,U,. (12.105) 
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For the integral equation (J — AK)y =/f, the method leads to the 
residual 
r, =f — (1 —AK)v =f— de — AK)y,. (12.106: 
E 


The coefficients c, can be determined from n conditions to be satisfied 
by 7,. Dealing with a continuous kernel and continuous functions 
S4,, one could apply the so-called method of collocation, which requires 
that the function 7, vanish at 2 prescribed abscissas. Another approach 
requires 


(u,,7,) = 90 for? =-15.2,6405:2:. (12.107: 
This leads to the equations 

> [(uistle) — (ui AKu,) IG = Unt) (12.108) 

k 


Equation (12.108) can be used for both the inhomogeneous equation 
(1 — AK)y = f and the eigenvalue problem (J — AK)y = 0. 

The selection of the functions u,, u,,... is of special importance. 
Very often orthonormal systems are used, especially systems with 
sin functions or systems of polynomials. In the case of a positive 
definite hermitian integral operator, Enskog [13] has suggested solving 
the inhomogeneous equation (A not an eigenvalue) for0 <4 < A,, by 
means of functions u,;, which satisfy 


(u,,Ux) — A(uKuy) = Ox (12.109) 


In this case the solution of (12.108) is simply ¢; = (u;,/). 

For some problems involving the inhomogeneous equation one may 
follow the pattern of relaxation. One starts with u, as a guess for }. 
calculates (J — AK)u, = w,, minimizes g = f — cw, by a choice of the 
constant c, and uses cu, as first approximation to y; thereafter a correc- 
tion is determined by treating (J — AK)z = g in the same manner. 

It can be shown that the system (12.108) is equivalent to an integral 
equation 


(I — AL)v = f,, (12.110 


where f, may be regarded as an approximation to f, and L as an approxi- 
mation to K; L is of finite rank, related to the functions u,. We found 
the same phenomenon in the preceding section on the analogy method: 


[see Eq. (12.100)]. This brings us to the more general method of © 


perturbation where the operator K is considered as a function of a 
parameter e such that 


K(e) = K, + €K, 4+ €?K,4+°°:, (12.111 


with operators Ky, K,,... not depending on e. This covers the cas 
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(12.110) withe = 1, AK, = Z,K, = K —L,K, =0,K, =0,.... To 
solve (J — AK(e))y =f, one sets 


DS Dg eh, eye ae 5 (12.112) 
and the functions y, are determined by the algorithm 


C _ AKo))n = Ky na + Ky ¥n-2 mr aa a Kos n= L; 
(I —AgKo) 0 =f. (12.113) 


The eigenvalue problem [J — AK(e)]y» = 0 is treated by starting witha 
solution of the eigenvalue problem of Ky. Let it be («gl — Ky)y_ = 0. 
Then a characteristic value «(e) and an eigenfunction »(e) are wanted 
which admit the expansions 


K(e) =Kyp tee toece, y=Jo tye tere. (12.114) 


A set of recursion formulas for the determination of x, and y, can be 
set up. In the case where A(e) and K,,A,,... are hermitian and 
where xg is a single characteristic value of Kg, we assume that »(e) and 
Yo satisfy the following normalizing conditions: 


lye) =1, (rode) = real forn =1,2,.... (12.115) 


The operator K, admits the split (12.48), Kg = A + B with respect to 
Ko. Writing 
R= (kel = 3) = «5 1A) (12.116) 
and ao Oe Fe 
Zn = Kayo + (Ay — el )n $0 + (Kaa — ral Dp 
S259, wnt, “(Ciel ke) 


the recursion formulas are 


5 (Zag) (12.118) 
ji = Ry, 


In = —)[Ov Jn) + ade) to + On_vJ1)] + R2z,; 
n= 2,3,.... (12.119) 


These formulas were first presented by Rellich [34], who established 
a general theory of the perturbation of operators in a Hilbert space. His 
fundamental results include the case of multiple characteristic values, of 
convergence of the series (12.114), and of error bounds for the remainder 
terms of the series, if the calculation is carried to x,, y, only. His theory 
has been simplified by Sz.-Nagy, who also improved the error bounds 
[27]. More work on error bounds has been done by Schroder [40]. 

Although Rellich established the existence of eigenelements according 
to (12.114), he also pointed out that there may be irregular eigenvalues 
of K(e) which cannot be found by (12.114). 
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Example. For e« < 0, the kernel 
K(s,t;€) = min (s,t) — st/(] + €), a=0,b=1, lel < 1, 


has an eigenvalue 1*(e), satisfying tanh V —2* = —eW —A*. One finds 
A* — —o when e -+ 0; neither A* nor 1/2* is expandable into a power series 
of €, converging in some neighborhood of « = 0. 


REFERENCES 


1. H. Bateman, Numerical Solution of Linear Integral Equations, Proc. Roy. Sx. 
London. Ser. A, vol. 100, pp. 441-449, 1922. 

2. R. Bellman, Note on Summability of Formal Solutions of Linear Integral 
Equations, Duke Math. J., vol. 17, pp. 53-55, 1950. 

3. L. Berg, Lésungsverfahren fiir singulare Integralgleichungen, Math. Nacar.. 
vol. 14, pp. 193-212, 1955. 

4. H. Biickner, A Special Method of Successive Approximations for Fredholm 
Integral Equations, Duke Math. J., vol. 15, pp. 197-206, 1948. 

5. H. Biickner, ‘Die praktische Behandlung von Integralgleichungen. Ergebnisse 
der angewandten Mathematik, Bd. 1,’ Springer-Verlag, Berlin, 1952. 

6. H. Buckner, Ein unbeschrankt anwendbares Iterationsverfahren fiir Fredholm- 
sche Integralgleichungen, Math. Nachr., vol. 2, pp. 304-313, 1949. 

7. H. Buckner, Konvergenzuntersuchungen bei einem algebraischen Werfahren 
zur naherungsweisen Lésung von Integralgleichungen, Math. Nachr., vol. 3, pp. 
358-372, 1950. 

8. G. F. Carrier, On the Determination of the Eigenfunctions of Fredholm Equa- 
tions, J. Afath. Phys., vol. 27, pp. 82-83, 1948. 

9. L. Collatz, Einschliessungssatz fur die Eigenwerte von Integralgleichungen, 
Math. Z., vol. 47, pp. 395-398, 1941. 

10. L. Collatz, ““Numerische Behandlung von Differentialgleichungen,’ 
Springer-Verlag, Berlin, 1951. 

11. L. Collatz, Schrittweise Naherungen bei Integralgleichungen und Eljgen- 
wertschranken, Math. Z., vol. 46, pp. 692-708, 1940. 

12. R. Courant and D. Hilbert, ‘““Methoden der mathematischen Physik,”’ 2d ed.. 
vol. I, pp. 96-133, Springer-Verlag, Berlin, 1931. 

13. D. Enskog, Kinetische Theorie der Vorgange in massig verdtinnten Gasen, 
dissertation, Uppsala, Sweden, 1917. 

14. W. Feller, On the Integral Equation of Renewal Theory, Ann. Math. Statist., 
vol. 12, pp. 243-267, 1941. 

15. L. Fox and E. T. Goodwin, The Numerical Solution of Nonsingular Integral 
Equations, Philos. Trans. Roy. Soc. London. Ser, A, vol. 245, no. 902, pp. 501-554. 

16. A. Huber, Eine Naherungsmethode zur Auflosung Volterrascher Integra!- 
gleichungen, Monatsh. Math. Phys., vol. 47, pp. 240-246, 1939. 

17. W. H. Isay, Beitrag zur Potentialstroemung durch axiale Schaufelgitter, Z. 
Angew. Math. Mech., pp. 397-409, 1953. 

18. W. M. Kincaid, Numerical Methods for Finding Characteristic Roots and 
Vectors of Matrices, Quart. Appl. Math., vol. 5, pp. 320-345, 1947. 

19. C. Lanczos, An Iteration Method for the Solution of the Eigenvalue Problem 
of Linear Differential and Integral Operators, J. Res. Nat. Bur. Standards, vol. +45. 
p. 255, 1950. 

20. C. Lanczos, Chebyshev Polynomials in the Solution of Large-scale Linear 
Systems, Proc. Assoc. Comput. Mach., p. 124, 1953. 


b) 


chap. 5. 


Go gle 


NUMERICAL METHODS FOR INTEGRAL EQUATIONS 465 


21. C. Lanczos, Solution of Systems of Linear Equations by Minimized Iterations, 
J. Res. Nat. Bur, Standards, vol. 49, p. 33, 1952. 

22. N. J. Lehmann, Beitrage zur numerischen Lésung linearer Eigenwertprobleme, 
Z. Angew. Math. Mech., vol. 29, pp. 341-356, 1949; vol. 30, pp. 1-16, 1950. 

23. A. T. Lonseth, Approximate Solutions of Fredholm Type Integral Equations, 
Bull. Amer. Math. Soc., vol. 60, pp. 415-430, 1954. 

24. F. Lésch, Zur praktischen Berechnung der Ejigenwerte linearer Integral- 
gleichungen, Z. Angew. Math. Mech., vol. 24, pp. 35-41, 1944. 

25. N. I. Muskhelishvili, ‘Singular Integral Equations,” P. Noordhoff, N.V., 
Groningen, Netherlands, 1953 (2d Russian ed., Moscow, 1946). 

26. N. I. Muskhelishvili, “Some Basic Problems of the Mathematical Theory of 
Elasticity,” P. Noordhoff, N.V., Groningen, Netherlands, 1953 (lst Russian ed., 
Leningrad, 1933). 

27. B. Sz.-Nagy, Perturbations des transformations autoadjointes dans l’espace de 
Hilbert, Comment. Math. Helv., vol. 19, pp. 347-366, 1946-1947. 

28. E. J. Nystrom, Uber die praktische Auflésung von linearen Integralgleichungen 
mit Anwendungen auf Randwertaufgaben der Potentialtheorie, Soc. Sci. Fenn. 
Comment. Phys-Math., vol. 4, no. 15, 1928. 

29. E. J. Nystrom, Uber die praktische Auflésung von Integralgleichungen, 
Soc. Sct. Fenn. Comment. Phys.-Math., vol. 5, no. 5, 1929. — 

30. E. J. Nystrom, Uber die praktische Auflésung von Integralgleichungen mit 
Anwendungen auf Randwertaufgaben, Acta Math., vol. 54, pp. 185-204, 1930. 

31. E. J. Nystrom, Zur numerischen Lésung von Randwertaufgaben bei gewohn- 
lichen Differentialgleichungen, Acta Math., vol. 76, pp. 157-184, 1944. 

32. W. Prager, Die Druckverteilung an K6rpern in ebener Potentialstrémung, 
Phys. Z., vol. 29, pp. 865-869, 1928. 

33. E. Reissner, Solution of a Class of Singular Integral Equations, Bull. Amer. 
Math. Soc., vol. 51, pp. 920-922, 1945. 

34. F. Rellich, Stoérungstheorie der Spektralzerlegung, Finf Mitteilungen, 
in particular, I. Mitteilung, Math. Ann., vol. 113, pp. 600-619, 1936; IV. Mitteilung, 
Math. Ann., vol. 117, pp. 356-382, 1940; V. Mitteilung, Math. Ann., vol. 118, 
pp. 462-484, 1942. 

35. F. Riesz and B. Sz.-Nagy, ‘“‘Lecons d’Analyse Fonctionelle,’’ Académie des 
Sciences de Hongrie, Budapest, 1952. 

36. P. A. Samuelson, Rapidly Converging Solutions to Integral Equations, J. 
Math. Phys., vol. 31, pp. 276-286, 1953. 

37. W. Schmeidler, ‘‘Integralgleichungen mit Anwendungen in Physik und 
Technik,” vol. I, ‘‘Lineare Integralgleichungen,” vol. XXII, ‘Mathematik und ihre 
Anwendungen in Physik und Technik,” Akademische Verlagsgesellschaft, Leipzig, 
1950. 

38. M. Schonberg, Sur la méthode d’iteration de Wiarda et Buckner pour la 
résolution de I’équation de Fredholm, I, Acad. Roy. Belg. Bull. Cl. Sci., ser. 37, vol. 5, 
pp. 1141-1156, 1951. 

39. M. Schénberg, Sur la méthode d’iteration de Wiarda et Buckner pour la 
résolution de |’équation de Fredholm, II, Acad. Roy. Belg. Bull. Cl. Sci., ser. 38, vol. 5, 
Pp. 154-167, 1952. 

40. J. Schroder, Fehlerabschatzungen zur Storungsrechnung bei linearen Eigen- 
wertproblemen, dissertation, Hannover, Germany, 1952. 

41. G. Schulz, Iterative Berechnung der reziproken Matrix, Z. Angew. Math. 
Mech., vol. 13, pp. 57-59, 1933. 

42. E. Schwerin, Uber aT eansvenalschwiagunaen von Staben veranderlichen 
Querschnitts, Z. Techn. Phys., vol. 8, pp. 264-271, 1927. 


Google 


466 SURVEY OF NUMERICAL ANALYSIS 


43. H. A. Schwarz, ‘‘Gesammelte mathematische Abhandlungen,”’ vol. I, pp. 
241-265, Springer-Verlag, Berlin, 1890. 

44. E. Stiefel, On Solving Fredholm Integral Equations, J. Soc. Indust. Appl. Math., 
pp. 63-85, 1956. 

45. G. Szegé, “Orthogonal Polynomials,’ American Mathematical Society 
Colloquium Publications, vol. 123, p. 26, 1939. 

46. J. Todd and S. E. Warschawski, On the Solution of the Lichtenstein-Gersh- 
gorin Integral Equation in Conformal Mapping, II; Computational Experiments, 
in J. Todd, ed., ‘Experiments in the Computation of Conformal Maps,”’ National 
Bureau of Standards Applied Mathematics Series, vol. 42, 1955. 

47. C. Wagner, On the Solution of Fredholm Integral Equations of Second Kind 
by Iteration, J. Math. Phys., vol. 30, pp. 23-30, 1951. 

48. C. Wagner, On the Numerical Evaluation of Fredholm Integral Equations 
with the Aid of the Liouville-Neumann Series, J. Math. Phys., vol. 30, pp. 232-234, 
1952. 

49. S. E. Warschawski, On the Solution of the Lichtenstein-Gershgorin Integral 
Equation in Conformal Mapping, I: Theory, in J. Todd, ed., ““Experiments in the 
Computation of Conformal Maps,”’ National Bureau of Standards Applied Mathe- 
matics Series, vol. 42, 1955. 

50. J. Weissinger, Ein Satz tiber Fourierreihen und seine Anwendung auf die 
Tragfliigeltheorie, Math. Z., vol. 47, pp. 16-33, 1940. 

31. J. Weissinger, Zur Theorie und Anwendung des Iterationsverfahrens, Afath. 
Nachr., vol. 8, pp. 193-212, 1952. 

52. E. T. Whittaker, On the Numerical Solution of Integral Equations, Proc. Roy. 
Soc. London. Ser. A, vol. 94, pp. 367-383, 1918. 

53. G. Wiarda, “Integralgleichungen unter besonderer Beriicksichtigung der 
Anwendungen,”’ p. 126, Teubner, Leipzig, 1930. 

54. H. Wielandt, Das Iterationsverfahren bei nicht selbstadjungierten linearen 
Eigenwertaufgaben, Math. Z., vol. 50, pp. 93-143, 1944. 

55. H. Wielandt, Bestimmung héherer Eigenwerte durch gebrochene Iteration, 
AVA-Bericht B44/J37, 1944 (unpublished). 

36. H. Wielandt, Die Einschliessungssatze von Eigenwerten normaler Matnzen, 
Math. Ann., vol. 121, pp. 234-241, 1948. 

57. H. Wielandt, Ein Ansatz von L. Schwarz zur Lésung singularer Integral- 
gleichungen, 1. Art., AVA-Bericht B44/J22, 1944 (unpublished). 

58. H. Wielandt, Einschliessung von Eigenwerten Hermitescher Matrizen nach 
dem Abschnittsverfahren, Arch. Math., vol. 5, pp. 108-114, 1955. 

59. H. Wielandt, Error Bounds for Eigenvalues of Symmetric Integral Equations. in 
American Mathematical Society, ‘‘Numerical Analysis: Proceedings of Symposia in 
Applied Mathematics—Volume VI,”’ J. H. Curtiss, ed., pp. 261-282, McGraw-Hill 
Book Company, Inc., New York, 1956. 

60. J. V. Vorobyov, Orthogonal Operator Polynomials and Methods to Approxi- 
mately Determine the Spectrum of Linear and Bounded Operators, Uspekhi Meat. 
Nauk, vol. 9, pp. 83-90, 1954 (in Russian). 

. 61, L. V. Kantorovitch, Functional Analysis and Applied Mathematics, L’spehAr 
Mat. Nauk, vol. 3, pp. 89-185, 1948; there is a translation by C. D. Benster, in 
G. E. Forsythe, ed., Nat. Bur. Standards Report 1509, 1952. 


Recent Literature 


Among the recent books and papers which are particularly relevant are the 
following: 


ee gle 


NUMERICAL METHODS FOR INTEGRAL EQUATIONS 467 


62. ‘Symposium on the Numerical Treatment of Ordinary Differential Equations, 
Integral and Integro-differential Equations.”” Proceedings of the Rome Symposium 
(20—24 September 1960) Organized by the Provisional International Computation 
Center, Birkhauser Verlag, Basel, Stuttgart, 1960. 

63. S.G. Mikhlin, “Integral Equations,’’ Pergamon Press, London, 1957. 

64. F.G. Tricomi, “Integral Equations,” Interscience Publishers, Inc., New York, 
1957. 

65. E. Martensen, Berechnung der Druckverteilung an Gitterprofilen in ebener 
Potentialstromiing mit einer Fredholmschen Integralgleichung, Arch. Rational Mech. 
Anal., vol. 3, pp. 235-270, 1959. (This paper is related to the equation (12.7), as is 
[32] also.) 


(Go gle 


Is 


Errors of Numerical Approximation 


for Analytic Functions 


PHILIP J. DAVIS 
NATIONAL BUREAU OF STANDARDS 


13.1 Introduction 


It has become traditional in works on numerical methods to express 
errors committed in the use of approximate formulas of integrauon, 
differentiation, interpolation, etc., in terms of the higher derivatives of the 
function operated upon. ‘The error consists of two parts: (1) a multipl- 
cative constant which depends solely on the numerical rule employed and 
(2) a higher derivative f'")(&) evaluated at some intermediate point of the 
interval J. Insome cases, the error is given by an expression of the form 
cf ‘(E) + cg f''™(n). In most cases, the exact value of ~ will be un- 
known, and the estimate max:,, | f ‘"’(&)| is employed instead. Such 
expressions of error are valid for the class of real functions which are 
sufficiently differentiable and therefore possess wide applicability. On 
the other hand, they have several drawbacks. In the first place, since 
the error term of different rules may contain different orders of deriva- 
tives, the respective accuracy of the rules may not be compared on a 
common basis. Secondly, data on the higher derivatives may be un- 
available or may be difficult to obtain. Such will be the case, for 
example, in operating with functions which are given in closed form but 
which are highly composite—for example, f(x) = [1] + (1 + e7)'?]':. 

In the present discussion we derive a complex-variable method for 
estimating errors which arise when approximation rules are applied to 
analytic functions. In contrast to the usual real-variable methods, th: 
method does not involve the use of the higher derivatives of the function 
but requires instead a knowledge of the size of the function in the complex 
plane. It is therefore of practical value in dealing with functions of the 
type described above. That such estimates are, in principle, possible for 
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analytic functions can be seen from Cauchy’s inequality 


f(z) 


goal (13.1) 


[f'™(t)| < const max 
zeCc 


which gives us information about the size of the derivatives of an analytic 
function in terms of its boundary values. 

We derive our estimates of error by introducing a Hilbert space of 
analytic functions and using the Riesz representation of bounded linear 
functionals. Theerror E( /) committed in the use of formulas of numeri- 
cal approximation applied to f may be estimated in the form |E(f)| < 
Ox\||f\|. This estimate is a Schwarzinequality. The quantity o, is the 
norm of the error functional ; it depends solely on the approximation rule 
employed and is independent of the particular function operated upon. 
The quantity || f|| 1s the norm of fin the Hilbert space and may be esti- 
mated from a knowledge of the values of the function in the complex 
plane. Having precomputed o,, we need only estimate || /|| in order to 
obtain the error in any specific instance. 


13.2 Essentials of the Method 


Let B designate a bounded region lying in the plane of the complex 
variable z = x + iy. By L?(B), we mean the class of functions which 
are single-valued regular analytic in B and are such that 


[fiver dx dy < ©. (13.2) 


B 


The class £?(B) has been studied extensively. For the fundamentals of 
this theory, see Bergman [2]. We note in particular that, if f(z) is regu- 
Jar in the closure of B, then (13.2) holds. The positive square root of 
the quantity (13.2) is termed the norm of f over L?(B) and is designated 
by |lfll or file. That is, 


fle =[[fr@ede]” sera). (13.3) 


If a region G satisfies G ¢ B and if f € L?(B), it is clear that f € L?(G) 
and that 


Sle < Wie. (13.4) 


The principal features of the class Z?(B) are as follows. We introduce 
the integral 


| f(a de dy =(fe) fe ge LB), (13.5) 
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as an inner product. The class L?(B) possesses complete orthonormal 
systems {¢,,(z)}in the sense of (13.5). Every function of the class can be 
expanded in a Fourier series 


fe) = ¥ a,04(2); (13.6) 
wherein (eal Se (n = 0,1,...). (13.7) 


The convergence of (13.6) is uniform and absolute in every closed sub- 
region of B. Moreover, we have 


io 6 


ee Dae |e (13.8) 


n=0 
le @) 
The bilinear series > ¢,(z)¢,(w) converges for z, w € B to a function 
n=0 


K,(z,w), which is called the Bergman kernel function of B and which 
possesses the characteristic reproducing property (/(z), K,(z,w)) = f(w} 
for all f € L?(B). Forsimply connected regions B with a Jordan bound- 
ary, there exists a complete orthonormal set of polynomials {p,,(z)}. 
Let E be a bounded linear functional over £?(B); its norm may be 
obtained in the following way. Let ¢,(z),n = 0, 1,..., be acomplete 
orthonormal system for Z?(B). Then it may be shown that 


JE|? = ¥ 1(e,)/ (13.9) 


This may be expressed in the alternative but equivalent form 
JE? = E,E;K p(2,0), (13.10) 


where the subscripts on the £ mean that the functional operation is to be 
carried out on the variable indicated. We have, then, for all fe L?(Bi, 


IE( AI < WE WS |, (13.11) 


with the equality sign attained for some f € L?(B). 
If z, designates a fixed point in the interior of B and if J designates a 
closed interval lying in the interior of B, then the expressions f ‘"?(z,), 


n=0,1,..., and { f(z) dz represent bounded linear functionals over 
I 


[?(B). Consider next some typical error expressions which occur in 
numerical analysis. For instance, 


E(f) = { fle) dz — Say fle 
is a quadrature error, | 


E,(f) =f" (Zo) — & a S(Z«) 


k=1 
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is an error in a differentiation formula, and 


n 


Ef) = flze) — 3 afew 
is an error in an interpolation or extrapolation. In view of the above 
remark, and subject to those restrictions on the location of z, and J, the 
errors £ that arise are bounded linear functionals over L?(B). We are 
therefore in a position to apply the inequality (13.11). 

We should like next to develop some practical error estimates for 
analytic functions. Let us assume that we are dealing with the fixed 
interval [—1,+1]. Iff(z) is analytic in the closed interval, then it also 
is analytic in a certain region D of the z plane which contains [ —1,+1] 
in its interior. If B designates a closed region with [—1,+1]cBcD, 
then f(z) « £7(B). Of course, if we want to deal with the family of all 
functions which are analytic on [ —1, +1], we cannot fix a B beforehand 
but must have available a family of B’s which collapse to [—1,+1]. 

Now, there are two cases in which the orthonormal polynomials over 
B have a simple structure. They are the circle and the ellipse, and in 
view of the preceding remark, we reject the former in favor of the latter. 

To discuss the case of the ellipse, it is best to assume that it has been 


TABLE 13.1 GEOMETRIC QUANTITIES FOR ELLIPSES 


a b p (wab)'2 
1.01 .1418 1.3266 .6607 
1.02 .2010 1.4908 .8026 
1.03 .2468 1.6302 .8936 
1.04 .2857 1.7574 .9661 
1.05 .3202 1.8773 1.0277 

: 4583 2.4282 1.2584 
1.15 .9679 2.9511 1.4324 

: .6633 3.4720 1.5814 
1.25 .7500 4.0000 1.7162 
1.30 .8307 4.5397 1.8420 
1.40 .9798 5.6634 2.0759 
1.50 1.1180 6.8541 2.2953 
1.75 1.4361 10.1515 2.8099 
2.00 1.7321 13.9282 3.2989 
2.50 2.2913 22.9565 4.2421 
3.00 2.8284 33.9706 5.1631 
4.00 3.8730 61.9839 6.9763 
5.00 4.8990 97.9898 8.7723 
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placed in a normalized position. Let an ellipse, therefore, have its 
major axis along the x axis and its foci at the points (—1,0) and (1,0:. 
Let a and 6 = (a? — 1)” designate its semimajor and semiminor axes, 
respectively, and let the quantity p = p(a) be defined by 


p=(a+6)%, a=V(pt+ pu), 6 = a(p* — pot). (13.12) 


This ellipse will be designated by &,. For values of p > 1, these ellipses 
form a confocal family and collapse to the segment [—1,+1] as p — 1. 
Table 13.1 gives the values of the geometric quantities 6 and p for a 
number of selected values of a. 

We introduce the Chebyshev polynomials of the second kind by means 
of the definition 


U,(z) = (1 — z?)-*sin[(n + 1) arc cos z] (n =0,1,...). (13.13) 
It can then be shown that the polynomials 
pa(2) =2(n + 1)%r-4(pr44 — p-"-1)-4U (2) (n= 0,1...) (IBF 
form a complete orthonormal system for L?(é’,). 


13.3. Quadrature Errors for L?(¢,) 


The program outlined in the previous section has been carried out to 
a considerable extent for quadrature errors. 

In conformity with the above normalization, we assume that the inte- 
gration to be performed is over the interval [—1,1]. The case of an 
arbitrary interval may be handled by means of an appropriate linear 
transformation (see Sec. 13.6). An arbitrary (N + 1)-point quadra- 
ture formula is given by 


- f(x) dx = Sa, fla) = RU), (13.15) 


where A, are certain abscissas lying in [—1,1] and a, are the associated 
weights. The error E = E(f) involved in the rule 2 is 


1 
E(f) =| I(x) dx — R(f). (13.16) 
i 
If f(x) is analytic on [—1,1], then it 1s clear that for some value of 
p > 1, fmay be continued analytically so as to be regular in the closed 


ellipse &,. Suchan/f(z) is therefore of class L?(é,) and may be expanded 
in a series of Chebyshev polynomials 


fl) =Zapl2, Lia = Wf <@, (13.17) 
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which converges uniformly and absolutely in the interior of 6,. Apply- 
ing the operator E to (13.17), we obtain 


E(f) = 2 anE (bn): (13.18) 
Let us now write - 
op? = LIE (Ale (13.19) 
then, in view of (13.8) and (13.9), there is obtained 
IE(S)| < op If il. (13.20) 


The quantity og, which is the norm over L?(é,) of the bounded linear 
functional £, depends only on the ellipse &, and the quadrature rule R; 
but is independent of f, and may therefore be computed once for all. 
Using (13.19) and (13.14), we have 

4 
_ 


Dx +) eee. (13.21) 


We have, moreover, 


U.(z) =(n +1) Ti,(z) (n=0,1,...), (13.22) 


where 7,,(z) designates the Chebyshev polynomials of the first kind 
defined by 
T,(z) = cos (n arccos 2) 0} Vos ca) (13.23) 
Therefore, 
OG ea at a ne See 
[7 Ga) de = 5 (Teall) - Tra] = 5 0 + (-097 
(13.24) 
If quantities 7, are defined by 
7A), n odd, (13.25) 
2 
ss n even, 
n+l 


then we have 


4 & n+ ] N 2 
Op? = =) ata * == 5 a,0,(40) | ° (13.26) 


k=0 

Table 13.2 lists the values of o corresponding to the trapezoidal, the 
Simpson, the Weddle, and the Gaussian 2-, 3-, 7-, 10-, and 16-point 
formulas and for a range of values of the parameter p. These values 


were computed from (13.26) on the National Bureau of Standards East- 
ern Automatic Computer (SEAC). 
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(ce—) 110°Z (S—) g06'1 | (8—) bear | (S—) zze°z | (E—) g89'z 
(62—) ZbE'l (C—) 668°S | (8—) Sebo | (S—) 969'8 | (E—) HE'S 
(St—) 96242 (b—) 899° | (z—) ¥90°9 | (+—) zoo | (Z--) OZE'I 
(IZ—) 0921 LoVe | (9—) 9¢8'2 | (E—) L901 | (z—) z8E'2 
(61—) 669°9 6L4°% | (C—) LEzz | (e—) Szzs | (Z—) F80°S 
(9I—) 8Ez'T 3 Love's | (c—) eze'8 | (S—) Bzz'8 | (Z—) 122'8 
(bI—) 420°8 19b'1 | (b—) 696'% | (Z—) B0z'% | (I—) GEST 
(ZI—) 788'l ccez | (€—) sort | (@—) ves's | (I—) Sor'z 
(1I-) PEeZ L60°% | (€—) Sze | (Z—) 7279 | (1—) SzO'S 
(Ol—) $49°S 8z9°¢ | (E—) BIc’s | (z—) ze9°8 | (I—) So8'¢E 
(6 —) LE€0'9 6608 | (Z—) spor | (I—) sezr | (1—) 60's 
(L —) 028°8 HiZt | (2—) t8t% | (i—) seer | (1-—) ¥90°2 
(9 —) L6I'2 6861 | (Z—) gles | (1—) 6EE'E | (CO ) OIT'T 
(+ —) SES'l ZIG | (I—) FLL | (I—-) osc'z | (0 ) 6zE'2 
(+ -) cich | ( 969° | (I—) 6b'z | (1—) zE9'6 | (O ) 166'2 
(¢ —) o8c'I ¥ee's | (I—) cece | (oO ) 60g | (oO ) 296s 
(¢ —) 006°9 OIZ'2 | (t—) sess | (o ) s6et | (0 ) 2009 
(2 —) LEL? PBL | (O ) zezt | (0 ) shoe | (1) OFT 
*yd-g|[ ssnesy 21PP2M uosdults [e@ptozodes 7 


uy 
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13.4 On the Error Estimate (13.20) 


The inequality (13.20) allows us to estimate quadrature errors from a 
table of values of o and form an estimate of the norm || f ||. Let us 
assume that we are dealing with a fixed rule R witherror £. The value 
of the parameter p is at our disposal to a certain extent. For functions 
f(x) which are analytic on [—1,1], there will be a range of values of p, 
l<p<p* < o,forwhich/f(z) €L?(é,). For each such p, the inequal- 
ity (13.20) is valid. To exhibit the dependence of (13.20) on p, we 


should write 
IE(f)| <or(p) IF lle,s 1 <p < p*. (13.20") 


Now || fly, = 0 when p = | and increases as p increases. On the other 
hand, o,,(p) decreases as pincreases. Hence, the best estimate will occur 
for some intermediate value of p. The inequality (13.20’) can therefore 
be improved by writing 


ELA < min on(e) Ife, (13.27) 


Let the rule R, as well as the ellipse &,, be fixed. Then the inequality 
(13.20’) is a best possible one in the following sense: there exist functions 
g of class L?(&°) such that 


IE(g)| = on(p) Ilgie,.- (13.28) 


In general, of course, the error will be less. There are two additional 
facts which, in the practical carrying out of this method, tend to diminish 
the precision of the method. The first is that the exact minimum in 
(13.27) is difficult to ascertain, so that what one does is simply to take 
the minimum of a finite number of such values. The second is that the 
norm of f cannot be computed exactly except in the simplest cases, so . 
that an appropriate upper bound must be used. The net effect of all 
this is to replace the right-hand side of (13.27) with a less precise upper 
bound, but one which is much more readily ascertained. 


13.5 Methods for Estimating || / | 


It appears, then, that the principal task confronting the numerical 
analyst when using the present method is that of estimating the norm 
If || over some ellipse &,. In this section we explain a number of devices 
which may be useful for this purpose. By the very definition (13.3), we 


have 
If 3, - {for ian (13.29) 
Ey 


and it may be possible in certain simple cases to evaluate (13.29) directly. 
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If f(z) is continuous in the closed ellipse &, and if we set 


M, = max | f(z)I, (13.30) 
then we have, from (13.29), ne 
Ife, <M | { dx dy, (13.31) 
so that Sp 
fle, < VaabM,, (13.32) 


where the quantities a and 6 are related to p by means of (13.12). The 
quantity M, or an upper bound for it can, in many cases, be obtained by 
algebraic manipulation. It may be more convenient to replace &, by 
some circle containing it, say the circle C,: 


pt+l 
|z| <a= Dp ‘ 
We then have 
fle, < Wille, <avz max [{(2)|, (13.33) 
zZjsa 


assuming that f(z) is regular in C,. 
If f(z) is of class Z?(C,) and its Taylor expansion is known: 


f(z) = 22,2" | (13.34) 


then it can be shown that we have 


‘ oO Oe: qn 
I lle, = 7a" an ; (13.35) 
n=0 


which is an exact evaluation of the middle term of the inequality (13.33). 
The quantity || / lz, may itself be expressed as an infinite series involving 
the Taylor coefficients «,, but such expressions are more cumbersome 
and consequently less useful. 

It should be pointed out that in cases of great complexity, there is 
always the possibility of obtaining M, directly by evaluating | f(z)| along 
the boundary of &, on a sufficiently dense set of points. Despite the loss 
of accuracy involved in combining (13.20) and (13.32), errors obtained 
in this way will, in general, be better than those obtained by using the 
conventional real-variable error expressions and estimating the deriva- 
tives which occur there by means of Cauchy’s inequality. This can be 
explained by the fact that the real-variable expressions must be valid for 
a wider class of functions. 
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13.6 The Case of an Arbitrary Interval 


The quantities of Tables 13.1 and 13.2 refer to the interval [—1,1}. 
The case of an arbitrary interval [a,)] is dealt with by means of the linear 
transformation 


roe 2w b+a 
~~ b—a b—a’ 
(13.36) 
ea a 
9 aad 


which carries the interval a < w < binto —1 <x <1. Let the rule 
E* be given on [a,6] as follows: 


E*(f) =[ fle) dw — Saf), (13.37) 
The analogous rule on [—1,1] is given by | 
BN =f Sel de — 3 eat (Fe) (13.08 


and is known by the same name. If f(w) is analytic on [a,5], then 


b — b 
g(x) =f( carer ae *) (13.39) 
2 2 
is analytic on [—1,1], and setting 
— 24, b+a 
aay. eae 


we have 


Bf) =| flew) deo — ¥ anf 
- be ee ie — Sage) . _| (13.40) 


so that from (13.40), (13.39), and (13.38), 


E*(f) = 5 E(g). (13.41) 


b — b — 
Thus, |E*(f)| = 5 1E(g)| < ene) lglley (13-42) 


The op(p) are the tabulated values (Table 13.2) inthe z = x + iyplane, 
and ||g\|g, also refers to this plane. 
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13.7 Examples 


1 
1. Estimate the error F incurred in evaluating | exp (e7)dx by Weddle’s 
rule. From (13.39) we have ? 


g(z) = exp [e*?], 


which is an entire function of z and is therefore of class L?(é,) for all 
p>. Now, 


lexp [e?+))/2]| = exp {Re[e+!)/2]} = exp [e+!)/2 cos 9/2]. 
Thus on &, we have 
lexp [et!)/2]| < exp [e(t))/2], 
By (13.42) and (13.32) we have 
|E| < )a(mab)% exp [e9+)2] oy, (W = Weddle). (13.43) 


According to (13.20’), the estimate (13.43) is valid for alla >1. Ofthe 
values tabulated, the right-hand member of (13.43) 1s minimized for 
a = 2.5 and yields 


|E] < 14(4.242) (315.7) (2.836 x 10-8) = .0019. (13.44) 


In the above work, the norm of exp [e‘‘+})/2] has been estimated crudely, 
and a number of improvements suggest themselves. ‘Thus, for instance, 


lexp fefF+h)/2]| < exp [eres 
and since this last function is concave upward, we have 
x 
4 Da {exp [el@+!/2] — exp [ef-a+1/2]}, —a<x<a. (13.45) 


It is now easily verified that for arbitrary A, B, C, 


A2 2 B262 
| [cae + By + C)? dx dy = nab (= * og oe cy), (13.46) 
Sp 
so that we have 
| _ { Aza? " 
JE, < $4(nab)'3(—" ar ) OW (13.47) 
] 
where A = — fexp [el@t))/2] — exp [el-9t)/2]}, 
5, texP [ ] — exp [ }} 13.48) 


C= Lofexp [efett/2] + exp [ef—a+1)/2)}, 
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For a = 2.5, we have A = 162.82, C = 158.65, so that 
|E| < 40(4.2421) (160.6) (2.836 x 10-%) = .00097 = (13.49) 
A conventional estimate of error — the formula 
EL < max |) + SMD], fe) = ex (C), 
O< F<1 


yielded the value E < .006. 


2. Estimate the error £ incurred in evaluating T(w) dw by means of 


the 7-point Gaussian rule. ‘This isa case in which conventional methods 
cannot be used, owing to the lack of information about the higher deriva- 
tives of the integrand. ‘Transferring to the interval [—1,1], we must 
consider the function g(z) = I'[4e(z + 7)]. This function is regular in 
|[z| < 7; hence we may take ain the range 1 <a< 7. Now, 


I(x + y)| < P(x) for x > 0, 
so that 


Ig(z)| = IP[a(x +7) + Ap]| < Tate + 7). 
By the concavity of the I function, 
lg(z)| < max {T[}4(a + 7)], P[e(—a + 7)]}, Ze 
Thus we have, from (13.42) and (13.32), 


|E| < 4(ab)'* max {T'[}4(a + 7)], P[22(—a + 7)]} o¢,0 


(G, = Gauss 7-point). (13.50) 
The selection a = 5.0 yields 


|E| < 14(8.772) (120) (3.867 x 10-15) = 2.04 x 107%, (13.51) 


For a further example of the utility of the method, the reader is 
referred to Henrici [6]. 


13.8 Quadrature Schemes 


One theoretical question which can be treated by the methods out- 
lined in Secs. 13.2 and 13.3 is that of the convergence of a scheme of 
quadratures of the form 


[se dx == Sane) — Q, GaP (13.52) 


when applied to certain distinguished classes of analytic functions on 
[—1,+1]. The question of the convergence of Q,(/) to the integral in 
(13.52) has been solved completely by Polya [12] for the case in which f 
is selected from the class of continuous functions. There seems to be 
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less discussion of the problem when /is selected from the class of analytic 
functions on [ —1,+1] or from certain of its subclasses. 

We say that the quadrature scheme (13.52) converges uniformly in 
L?(B) if, having been given an ¢« > 0, there is an ny = ny(e) such that, 
for all fe L?(B) and n = nog, we have 


a f(x) ax — Son ieee ere ra (13.53) 


Theorem 13.1. A condition necessary and sufficient in order that the 
quadrature scheme (13.52) converge uniformly in L?(B) ts that 
lim |Z, ||? = lim £,,£,K p(z,w) = 0. (13.54) 
Proof. Suppose that (13.54) holds. Then, given an e > 0, we can 
find an n,(e) such that ||F,|| < «forallnz > )(e). Hence, by (13.11), 
the inequality (13.53) must hold. Conversely, suppose that (13.53) 
holds. For each 2, it is possible to find a nontrivial function f, (z) € L?(B) 
such that 


IE(A)L = VE, Wl. (13.55) 


By (13.53), given an e > 0, we may find ann = n,(e) such that for all 
n >n,(e) and for all f e L?(B) we have |E,( f)| < «|| fl]. Hence, in 
particular, for the f, of (13.55), 


IF Pall = 12,0 f)1 < € Sill (13.56) 
Therefore (13.54) must follow. 

We note that, in view of (13.9), the condition (13.54) can, in principle, 
be converted into a necessary and sufficient condition on the weights a,, 
and abscissas A,,. 

Corollary. From (13.21), @ condition necessary and sufficient in order that 
the quadrature scheme (13.53) converge uniformly in L?(€,) is that 

o 7 \ {2 
lim és > (4 4+ 1) a =), (13.57) 


13.9 Interpolatory Quadrature 


An important class of quadrature schemes is formed by those which 
are of interpolatory type. For such quadratures we have 


Qf) =] fla) dr, (13.58) 


whenever fis a polynomial of degree not larger thann. Ifthe scheme is 
of interpolatory type, then (13.57) becomes 
4 » Fy? 
liim- ¥ (k +1) Lotte = 0, (13.59) 


stl _ _-k-1 
no T pony] P P 
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In view of the inequalities 


p-tp-* < (p**! — p-*1)-) <(p—p')tp* = (p >1), (13.60) 
condition (13.59) is equivalent to 


lim > (+1) ee = 0. (13.61) 


With +7, given by (13.25), then (13.61) becomes 


fo 6) 


lim > (k+ |r — ¥ 5,0 e(dns) |*0-* =0. (13.62) 


n—-o k=n+1 


The following sufficient condition for the uniform convergence in 
L?(&,) of aninterpolatory quadrature scheme can now be obtained. Set 


M, = > |a,,I, (13.63) 
j3=0 


and observe that for real abscissas 4 in [—1,+1] we have 
[U,(A)| <A + 1. (13.64) 
Then, using (13.25) and (13.63), for fixed p > 1 we get 


@ 


Ek + DV] re — Satan) forts > +1) 


k=n+1 =n+1 
[ret (k+ I MPet <4 S[k + Dot 
=n+1 


+4M, ¥ (k+1)p*#+M,? D (b+ 1)%+ 


k=n+1 k=n+1 
<o(1) + C,M,np-" + C.M,?n3p-" (n —> 00), 
(13.65) 
where C, and C, are two positive constants which may depend on p but 
are independent of n. Thus, we have the following result: 


Theorem 13.2. Let 
lim M,n’*p-"/2 = 0. (13.66) 


Then an interpolatory quadrature scheme converges uniformly in L*(é,). 
Polya ({12], p. 285) has remarked that, if 


lim (M,)1" = 1, (13.67) 


nm—-° co 


then an interpolatory quadrature scheme converges for all functions 
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which are analytic in the closed basic interval. Under hypothesis 
(13.67), we have 
M, = (1 a Sa) s ce, —> 0, 


so that (13.66) holds with all p > 1. Thus, under Polya’s hypothesis, 
we see that the convergence is also uniform in every L?(6,,), p > 1. 


13.10 Newton-Cotes Quadrature 


We turn now to a specific quadrature scheme on [ —1, +1], namely, 
the Newton-Cotes scheme. In this scheme, we have 


Qf) = anof(—1) + au f(—1 + 2/2) + ane f(—1 + 4/n) +°-- 
+a,,f(1) (n=1,2,...), (13.68) 


where the Cotes numbers a,, have been determined so that 


Qf) =f fds 


holds for an arbitrary polynomial of degree <n. We have now the 
following estimate due to Ouspensky [11] (Ouspensky’s basic interval is 


[0,1]): a ee 
n — =) 
ann = — ell P ts naa + Max)» (13.69) 
where 7,,, + 0 as n — oo uniformly fork = 1,2,...,2 —1 and 
2 a 
a,0 = aan = n log n (1 -f- €,)> €, > 0. (13.70) 
Thus, 
x 4(1 + 6,) 2) a 4 - 
AT. = <, ———_—__— : Jl 
n 2, lan < n(log n)? Dat a n log n (1 ah En) (13 / ) 


where we have written 7,, < 6, (kK = 1,2,...,2—1),6, +0. Hence, 


4(1 + 6,)2” 4 
Pen ees 


4G 
a n ake. 


— 7? 
n(log n)? nlogn his): 72) 
Condition (13.66) now holds with p’? > 2. We have therefore arrived 
at the following result: 

Theorem 13.3. The Newton-Cotes quadrature scheme converges uniformly 
in L?(é,) whenever p > 4. 

Investigation of the convergence of the Newton-Cotes quadrature 
scheme has an interesting history which is worth retelling here. Stieltjes, 
in 1884, first proved the convergence of the Gauss mechanical quadrature 
for the class of Riemann integrable functions and, in a letter to Hermite, 
raised the question of the convergence of the Newton-Cotes scheme. In 
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1925 Ouspensky [11] arrived at the asymptotic result (13.69) and, from 
the growth of Cotes numbers, concluded only that the Newton-Cotes 
scheme is devoid of practical value. In 1933 Polya [12] showed that 
this scheme is not valid for all continuous functions and, indeed, is not 
valid for the class of analytic functions. Pdolya’s counterexample, 
referred to the interval [—1, +1], is 


f(w) = — Sa" sin k![(w + 1)/2] 


eee, | se 


for which the Newton-Cotes scheme diverges. The functions f(w) are 
regular in the strip 


—2loga 
WT 


|Im(w)| < (13.74) 


and have a natural boundary along the sides of the strip. The widest 
such strip must be less than 


|Im(w)| < = .4412. 


2 log 2 
WT 


The function (13.73) cannot, therefore, be continued analytically to 
é,.4, for which the semiminor axis is 6 = .7500. Therefore, Theorem 
13.3 rehabilitates the Newton-Cotes quadrature scheme for functions 
which are regular over a sufficiently large portion of the complex plane. 
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BASIC CONCEPTS 


In the first part of the chapter, Secs. 14.1 to 14.3, we state the funda- 
mental concepts which we require later. For all details and further 
information we refer to the books on functional analysis (e.g. [2, 9, 11, 12, 
14]) listed in the bibliography following Sec. 14.3. 


14.1 Metric Spaces 


A set £ is called a metric space if for every two elements x, y of ¥ there is 
defined a real-valued function p(x, y), called the distance between x and », 
which has these properties: 


(1) p(x, y) = 0 if and only if x = 9; 
(1) = p(x, vy) < p(z,x) + p(z,y) for any three elements x, y, z. 


It follows that p(x,y) = 0 and p(x,y) = p(y,x); hence the triangle axiom 
(11) becomes p(x,y) < p(x,z) + p(2,7). 

In a metric space & the set of points x € ¥ with p(x,x9) < ris called the 
(open) sphere S(xo,7) of center x,(€X) and radiusr(>0). Aset Mc <Xis 
said to be open if every x € M is the center of some (open) sphere con- 
tainedin X. This definition of open sets introduces a topology in ¥ under 
which ¥ becomes a normal Hausdorff space. 

A sequence {x,}, x, € ¥, converges to a point x € & if p(x,,x) +0 
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as n—> 00; we write x, +xasn-—» oo. Asequence {x,}, x, €X, iscalleda 
Cauchy sequence if p(x,,x,,) +0 asn,m-— oo. Clearly, if x, — x as 
n —> oo, then {x,}is a Cauchy sequence, but every Cauchy sequence {x,} 
need not converge to a point x € &. 

A metric space & is said to be complete if every Cauchy sequence in £ 
converges to a point of &. 


Example. The field of real (or complex) numbers considered as metric 
space with p(x,y) = |x —_»| is a complete metric space. Wi, 


A metric space ¥ is separable if there exists a countable set which is 
dense in &, that is, whose (topological) closure is &. 

Let X, 2 be two metric spaces, distinct or not, with metrics p,, py, 
respectively. A single-valued function » = T(x) with domain Dc & 
and range R < Y is called an operator (or mapping) from D onto K or, 
simply, an operator from D into 9); that is, to every x € D there corre- 
sponds a unique y € ®, andeveryy € Ris the image of at least one x € D. 
If S < §, then the set of points x € D for which T(x) < Gis called the 
inverse image of S and is denoted by 7—-!(S). The operator is said to be 
one-to-one if S = 7(U) implies U = 7T-1(G) for every U Cc D. 

If y = 7(x) is a one-to-one operator from D © Xonto KR < Q, there 
exists a unique single-valued function x = 7~1!(¥), called the inverse 
operator toy = T(x), with domain R( < Y) and range D( < X) such that 
T[7-!(»)] = for every y € R, and 7-1[ 7(x)] = x for every x € D. 

The operator y = 7(x) from D © Xinto Yis continuous at x, € Difto 
every € > 0 there corresponds a d(€) > O such that py[7(x), T(%)] <« 
for allx € D with p,(x,x9) < dor, equivalently, ifx, x implies T(x,) > 
T(x9). If» = T(x) is continuous at every x € D, it is said to be con- 
tinuous on D. 


14.2 Linear Spaces 


A set ¥ with at least two distinct elements is a linear (vector) space (or 
linear system or module) over the field of complex (real) numbers if its 
elements admit of two operations, called addition and scalar multiplica- 
tion, such that 


1. 1s an abelian group with respect to addition. 
2. Scalar multiplication has these properties: 
a. Toevery complex (real) number « and every x € & there corre- 
sponds a unique ax = xa € &. 
b. (a + B)x = ax + Bx, (aB)x = a(Bx) for every complex (real) 
a, P and every x € &. 
c. a(x + 7) = ax + ay for every complex (real) « andevery+x,)€ &. 
d. 1-x =x forevery x € &. 


The points of a linear system are usually called vectors. 
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A set of n vectors x, € X is said to be linearly independent if D3 a,x, = 0 


implies «; = 0 for all 1; otherwise, the n vectors are said to Be linearly 
dependent. If X contains a linearly independent set of n vectors and 
every system of (n + 1) vectors is linearly dependent, then X has dimen- 
sionn. Ifnis not finite, X is said to be of infinite dimension. 

A linear subspace M ofa linear space Xisaset M ¢ Xsuch that, if x, y 
belong to M, then ax + By e M for every complex (real) «, £. 

A linear space & is called a normed linear space if to every x € X there 
corresponds a real number ||x||, the norm of x, with these properties: 


(1) |x|] =O and |x|] = 0 if and only if x = 0, 
(11) llox|] = Jo] Ix, 
(111) lx +51 < lel + tol. 


If the distance between x and y» is defined by p(x,y) = ||x — J’, then 
becomes a metric space. 

A complex (real) linear space & is called an inner-product space if for 
every two vectors x, y of ¥ there is defined a complex-valued (real-valued) 
function (x,y) called the inner product of x and_y, such that 


(i) (x) = Ox) 

(11) (ax + By, z) = a(x,z) + B(y,2) for every x, y, Z of X¥ and 
every complex (real) «, #. 

(iii) (xx) >O and (x,x) = 0 if and only if x = 0. 


If X is an inner-product space, then ||x|| = + V/ (x,x) isa norm for X; 
this follows from the Schwarz inequality |(x,3)|? < (x,x)(,7). 

A normed linear space ¥ which is complete is called a Banach space. 
An inner-product space X which is complete under the norm ||x|| = 


+ (x,x) is called a Hilbert space. 
Example. The n-dimensional euclidean iepace is a Hilbert space. The 


set /, of all complex sequences {«,,}such that |x ,|? < oo for some p € [1,00) 
is a Banach space under the norm 


a} = b jal] 


Let X, 9 be linear spaces. An operator y = A(x) from D < & into 
Y) is said to be linear if D is a linear subspace of ¥ and if A(ax + By) = 
aA(x) + BA(y) for every x, y of D and every complex (real) «, £. 
It follows that the range § of A is a linear subspace of Y. If Q is the 
field of complex (real) numbers, a linear operator y = A(x) from a 
linear space X into Q is often called a linear functional. 
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A linear operator A from a normed linear space ¥ into a normed 
linear space 9 is bounded if ||A(x)|| < M ||x|| for all «eX. The 
number |\|A|| = sup {||Ax||, |x] < 1} is defined as the norm of A. 
A linear operator from a normed linear space & into a normed linear 
space 9) is continuous on X if and only if it is bounded. 


Example. Let E,, be the linear space of real m-dimensional column vec- 
tors x. A linear operator y = A(x) from E,, into itself is defined by a system 
of linear equations 


m 
= Stabe i=, m 


and hence is known once the m X m matrix (a,,) is given. If £,, 1s normed 
by the /,; norm 


m 
[xl] = ps el, then = ||4]] = a > laced; 
i 


if EF, is normed by the /,, norm 
||x|| = max [é,l. then || Al] = max ~ la,zl- 
t 


If £,, is given the usual euclidean /, norm 


m 14 ¢ 
lel = (Slat) then ll < (¥ laut?) 


Let ¥ be a(real) Banachspaceand &* the set of bounded linear functions 
from X to the real field Z,. Under the ordinary definition of addition 
and real scalar multiplication, ¥* becomes a linearspace. In fact, if the 
norm of x* € X* is defined by ||x*|| = sup {|x*(x)|, [|x|] < 1}, #* can be 
shown to be a (real) Banach space, called the adjoint (or conjugate) space 
of &. 

Let ¥ be a Hilbert space and A a linear operator from a set D dense in & 
into X. The (Hilbert space) adjoint A* of A is defined as follows: the 
domain of A®* is the set of all y* € X for which there exists an x* € ¥ such 
that 

(Ax,y*) = (x,x*) for all x € D. 


The vector x*, which is unique, is then the image of y* under the adjoint 
operator A*; thatis,x* = A*(y*). Clearly, A*isa linear operator from 
¥ into itself. 
If X = E,, and A is the linear operator given by the matrix (a,,), the 
adjoint A* is defined by the transposed matrix (a,;). Observe that 
(Ax,y) = (x,A*y) for all x, » of £,,. 
A linear operator A defined on a domain D dense in a Hilbert space 
is called self-adjoint if the domain ofits adjoint A* is Dand A(x) = A*(x! 
foreveryx € D. Asin matrix theory, a self-adjoint operator A is positive 
if (Ax,x) > 0 for every x € D, x 4 0, and positive definite if (.4x,x) = 
u(x,x) for every x where p > 0. 
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A linear operator A with domain ®D in a normed linear space ¥ and 
range in a normed linear space 9) has an inverse A~! provided A(x) = 0 
ifand only ifx = 0. Clearly A~! is a linear operator. Moreover, A—} 
exists and is bounded if and only if a constant «4 > 0 can be found such 
that ||A(x) || > yu ||x|| for every x € D; the supremum of admissible y is 
[Amty-*. 

Let ¥ be a Banach space and A a linear operator from Xintoitself. If 
|| Al] < 1, then (J — A)—} exists (J being the identity operator) and has 
domain &, and ||(J — A)-}|| < (1 — ||A]})72. 


14.3 Gateaux and Fréchet Differentials 


Let X, Y be real normed linear spaces and y = 7{(x) a (linear or non- 
linear) operator with domain D © Xand range R < Y; to avoid triviali- 
ties, itis assumed that D hasanonempty interior. Let xg be an interior 
point of D and lethe X¥ be arbitrary. Then x, + the Dforsmall enough 
(real) t, say |t| < €(x9,h). Ifthe limit 


5T (xh) = lim [T(% + th) — T(x,)] 


exists, it is called the Gateaux differential of Tatx,withincrement hk. If 
T has a Gateaux differential at x, for every increment h € X, then Tis said 
to be Gateaux-differentiable or, simply, G-differentiable at x); and if T is 
G-differentiable at every x, of anopen set Dy © D, it is called G-differen- 
tiable on Do. 


Example. Let ¥ be the ordinary euclidean (£,,é,) plane and 9) the real 
line. Then n = F(é,,é,) is G-differentiable at a point (£},&) of an open set 
De Xif 


. od 
OF (x,h) = Hae [F(é; + thy, & + the) — F(E;,62)] 
t 
exists for every A = (h,,h,) EX. This is equivalent to requiring that the 


directional derivative of F(&,,é) exists at (¢),¢,) for every direction vector 
he kX. 


If 67 (x 9,h) exists, so does 6T(x9,ah) for any (real) a and 
OT (x9,ah) = a OT (x,h). 


If 7,, T, have a G differential at x, with increment fA, so has T = 
27, + BT; for any real «, 8 and 


OT (Xk) = 2 6T,(%,h) + B OT2(xo,h). 


The G differential 6 7(x9,h) is not necessarily linear in 4, and even ifit is, 
it need not be continuous in A&A. If Tis G-differentiable on an open set 
Do © D such that 67(x,,A) is continuous in x at some x, € D for every 
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fixed h € X, then 67(x,,h) is linearin fh. In this case, dT (x9,h) = A(hi, 
where A is a linear operator from Xinto Y. At times A(A) is called the G 
derivative of T at xp. 

T is said to be Fréchet-differentiable or, simply, F-differentiable at an 
interior point x, € D ifit has a G differential 6 7(x9,h) which is linear and 
continuous in / - if 


im Tl ' | T(x) + 4) — T(xq) — dT (%p,A) || = 0. 

601(Xp,h) is on called the Fréchet differential of 7 at xg with incre- 
menth. If Tis F-differentiable at every point of an open set Dy © DP, it 
is called F-differentiable on Dog. 

The correspondence between # € ¥ and the F differential 6 T(x 9,4) is by 
definition a bounded linear operator A, depending on x»¢ D. The 
correspondence between x, € D and A, is an operator from PD into the 
normed linear space €(X,¥) of bounded linear operators from ¥ to 9; 
this operator is called the Fréchet or, simply, the F derivative 7” of T. 

If Tis F-differentiable at x, € D, then Tis continuous atx,. If 7}, 7; 
are F-differentiable at x» € D,sois T = a7, + BT, for any real z, f, and 


T’ =aT, + BTy. 
If T is F-differentiable on an open convex set Dy © X, then 
| T(x +h) — T(x) || < All sup {| T(x + tf) ,0 <t < Vj 


so long asx € Dy andx + he Dp. 

A linear operator A is F-differentiable only if it is defined and bounded 
on X. If A’ exists, A’(x) = A, so that A’ is defined on & also. 

Ifan operator Ton D to 9 is F-differentiable on Dy ¢ D, then JT’ may 
be F-differentiable on some set D, © Dy. Inthis case the F derivative of 
T’ is called the second F derivative of J and denoted by 7”; it is an oper- 
ator from D, into the normed linear space of bounded linear operators 
from X into €(X,%). 

If Tis twice F-differentiable on an open convex set D, © &X, then 


| T(x +h) — T(x) — T(x)Al) <4 |All? sup {T(x + tA), 0 <0 <1} 
so long asx € D, andx + he Dj. 
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SUCCESSIVE APPROXIMATIONS 


We prove here the central-fixed-point theorem on successive approxi- 
mation and show that many of the iteration procedures found in nearly 
every branch of applied mathematics are merely applications of this 
theorem. Our presentation is intended to point out the power of the 
basically simple theorem and to indicate at the same time its limitations. 
We do not aim at the most general treatment possible. 


14.4 The General Theorem 


Let ¥ be acomplete metric space with metric p and 7 an operator with 
domain D and range in & such that for all x, yin D 


pLT (x), 70) ] < ap(x,y) 
where 0 <a <1. Such an operator is called a contracting operator 
(or mapping); it is obviously continuous on D. 
Theorem 14.1. Let x, € D be selected such that the closed sphere S(x,r) of 
center xy and radius r > (1 — x)~'p[x 9, T(x) ] 15 contained in D and define 


Rg ol ee a 0 ee (ee 
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Then {x,,} converges to a point x* € S(xp,r), which is the unique solution [in 
S(xo,r)] of the equation x = T(x). 

Proof. We show first that x, € 5(x9,r) for all integers n > 0. Thisis 
clear for n = 0; since, for every integer k > 1, 

P(Xe41%e) = PLT (%e), T(Xe-1)] Sep (%es%ea) So0* Sak(L —a)r<r,; 


it follows by induction that 


P(%n+1%0) S 2 pert) < (1 —«)r 2, a® = (lL — am) <r, 


Further, for any integer p > 1, 


Pp 


P(XnsXn+p) SX pltn sev nsx) at a" (1 _ a?) (1 — a)—"p[Xo, T(%9)] 
<a"(l — 2?)r, 


which shows that {x,}is a Cauchy sequence. Since Xis complete, x, — 
x* asn — oo and, evidently, x* € 5(x,,r). Moreover, for anyn > 1, 


p(x*,T(x*)) < p(x*,x,) + pl[*,, T(x*)] 
= p(x*,x,) + pl T(x, 1), 7(x*)] 
Ss p(x*,x,) a ap(x*,x%n_1)5 
so that, as2z — 00, p[x*, T7(x*)] = 0, whichimpliesx* = 7(x*). If) # 
x* were another solution of x = T(x), then p(x*,y) = p[T(x*),7(>)] < 
ap(x*,y), whence p(x*,y) = 0, which is absurd. 
Remark. A bound for the error at the nth iteration is 


p(Xn»x*) = a(1 = a) p(x, _1*a), 
which follows from 


P(XnXnen) SY P(Xnse-r¥n4n) SCL — @?)(L — a) p(%q-a9%n) 


—hA3 


on letting p > oo. 


14.5 Solution of Systems of Equations 


Let ¥ be the linear space £,, of real m-dimensional column vectors x 
metricized by one of the following norms: 


(i) 4 norm: Ix = 3 Ue 
(ii) 2, norm: |x|] = (S el?) 
(iu) ¢,, norm: k= mas |é,|. 
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Under each norm, E,, is complete. Let 
x = 7 (x) 


be given, where T is a vector function defined on a set D © E,, whose 
components 7,(¢),...,&,) satisfy on D Lipschitz conditions 


Ire(Bas © Em) — Talis «=< tw) SD Mes Ey — ul (R= em) 


To solve this equation on D by iteration in “total steps”’ 


Xn41 = T(x,) 


starting from an x, chosen according to Theorem 14.1, we introduce the 
“comparison” matrix M = (u,,;), 1 < 1,7 < m, and the vector function 
a(x) defined by a,(x) = |é,|,4 = 1,...,m 

Interpretingx <_y to mean, as usual, &, <7, foreachk = 1,2,...,m, 
we can then write the Lipschitz conditions in the form 


a[ 7 (x) — T(y)] < Ma(x —»), 
whence, for each of the three norms indicated above, 
IP(x) — TO) = lel T(x) — TO) II < IMa(x — 5) Il 
< Ml lla(x —») Il = IMI le — ol, 
since, obviously, ||«(x) || = ||x|| always. Therefore, 
[IM su<l 


is sufficient for the convergence of the successive approximations. Thus, 
we obtain for the three norms three (sufficient) conditions for convergence 
and these error estimates: 


(i) 7, norm: 


|M|| = wy = sei > Ital <1, 
i= 


S fy ts 
2 JE, — Pl < - = [een — En), 
se] ‘F221 


(ii) 2, norm: 
m 1g 
IME <= (¥ taal?) <1, 


(2 sae ce ‘lige 
a er 


x (n) _ ("—1) Als 
eras aes i 


(50 gle 


494 SURVEY OF NUMERICAL ANALYSIS 
(iii) 7, norm: 


m 
IM = we = max 2, Hel <1, 
t =1 


max |£,() — &#| << —S2— max [é,(") — &-0], 
i 1 — Ho i 
To solve the equation x = 7(x) by iteration in “single steps” 
ae = mle”, ey Saar ae See <9 a P k an l, cee M, 


we define the vector function S with components 
Ne = A(X) = TEM + + +» Me-19 Ses + + > &m) (kK = 1,2,..., m). 
The single-step iteration formulas can then be written as 
Rein = (¥5), to 1, 2, eas 


We now split the comparison matrix M@ = M’ + M”, where 


0 fori <j 
M’ = (mj) with m;; = 
Mi; fori: >] 
5 fort <j 
and M" = (m;;) with mi = ; . : 
0 for? >j] 


Then the Lipschitz condition assumes the form 
a[S(x) — S(y)] < M’a[S(x) — S(y)] + M’a(x — 5). 


Since M’ is lower triangular, the matrix J — M” is clearly nonsingular, 
and hence N = (J — M’)-1M” is defined everywhere; moreover, 
Nx >Oforx >0, since w,, >Oforl <1,7 <m. Thus, 


a[S(x) — S(_y)] < Na(x —)y), 
whence for each of the three norms 
S(x) — Sy) = loefS(x) — SCx)I < Na(x — 9) IS UN Hae — 3) 
= || Ix —Jl. 
Therefore, the condition 
y= |NI <1 


suffices for the convergence of the iteration, provided the initial approxi- 
mation is selected according to Theorem 14.1. | 

If |M'| <1, then ||MI| < |M') (1 — |M'|))-1; if, moreover, 
| M’|| + ||Af"|| < 1, then this inequality may be extended to ||| < 
\M"|| (1 — | M'y)-? < At’ + |". These estimates are rough; for 
details we refer to [8, 9]. For the J, norm of E,, an exact evaluation of 
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| .V| was first given by Mehmke and Nekrasov [10]; see also Sassenfeld 
[11], who proceeds as follows: 

Let e denote the vector in E,, whose components are all 1, so that |le|| 
=1. Since Nx >0 for x >0, it follows that Ne > Na(x) > N(x) 


if |x| < 1, whence 


| Ni] = Nel] = max n,. 
where n= > ies >) Hirty ai 3 Hix for 2 <J sm — l, 
1 
and Ly = oy HL mklx oe Hmm: 
1 


14.6 Systems of Integral Equations 


The previous results carry over without essential changes to an m- 
dimensional system of Fredholm integral equations 


x(t) = 2(t) + af «[t,5,x(5)] ds 


over a finite interval [a,b]. Here Aisa parameter, and « is assumed to be 
continuous in its variables on some set D and to satisfy uniformly a 
Lipschitz condition in x. We omit the details. 

In the case of a Volterra equation of the second kind, 


PO ee [ ede ae 


the situation is somewhat different. For the sake of simplicity, let us 
restrict ourselves to a single equation—x(t) a scalar function—and 
assume that «(¢,5,u) is continuous on the box 


[lé— al < 4, ls — a] < dp, jul < 4, 
and satisfies there the Lipschitz condition 
|x (t,5,u) — «(t,5,v)| < plu — o| 


uniformly int ands. Let © be the normed linear space of (real) con- 
tinuous functions x(¢) on the interval [a — 69, a+ 69] with norm 
||| = max |x(¢)| and define 

f 


T(x) = 2(2) +| x[t,5,x(5)] ds 


on the sphere in © with center at the origin and radius 6,._ Thus, for 


ft >a, 
suf) - 


Tx) — T()l =| [ flesa] — «lesrl0 
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t 
Hence, taking norms and writing J(x) -| x(s) ds, we find 
PE EN nan (Eta) 
< #max J (|x — y|) < # [lJ lle —Jl, 
t 


and it is easy to see that ||J|| = 6). Therefore, 


| T(x) — T(y)l] < wdo lx — Il, 


so that ud, < | is sufficient, by Theorem 14.1, for the convergence of the 
iteration x,,, = 7(x,) to the (unique) solution x = T(x), provided, of 
course, that x9(¢) is chosen suitably. 

This result is not so satisfactory as the analogous result for Fredholm 
equations. One reason for this is the fact that in the case of linear 
Volterra equations the iterations are known to converge no matter what 
initial approximation is chosen. Theorem 14.1 has been generalized 
[2, 5-8] so as to include this particular case as well as other similar ones. 
These generalizations consist mainly in the introduction of metric 
spaces in which the distance between two points is no longer a non- 
negative real number but rather an element of a certain type of partially 
ordered set. 


14.7 Boundary-value Problems 
Consider the boundary-value problem 
Fe eg POV SH Fleas | (0<m <n -— }), 
U fy] = UL (ay), + 9M (aa)s = W(Ge)s oo + I (Gy) ] = 9 
C71, stag t); 


where the constants a,,..., a, are given, and suppose that the reduced 
problem 


Feng |. = 7); U,Ly] = 9 ‘Ce teers) 


has for any given continuous r(x) a unique solution that can be deter- 
mined explicitly. Then the following iteration procedure can be set up: 


Fix, Deans . 8 ea] ai Fel, yes se Sl (k a 0, I, eee 2) 
O [+1] = 0 (7 =1,...,m), 


‘starting from a function y)(x) which is of class ©” on the interval con- 
sidered and satisfies U,[_y)] = 0. Schréder [7] determined conditions 
for convergence under the assumptions that 


Fil. 9™) = & ailzyy™, 
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where g,,(x) > 0 and g,(x), 1 <2 <n, is continuous on the compact 
interval [a,6], and 


UL) = 3 laa) + BB) 6 GH LB.) 


We do not describe his results here but, rather, restrict ourselves to 
discussing the special case 


yw +flayy) =90, 20) =, pO) =e, 


which was treated first by Picard [4] and later by Lettenmeyer [3], 
Collatz [1], and also Schroder [7]; our treatment follows Collatz. 

Suppose / (x,y,p) is continuous on x € [0,1], |y| < 6,, |p| <6, and 
satisfies there 


lf (x.),P) =f (x) 0") <mly —y*| + Melp — p*l 


uniformlyinx. Let 


x(1 — O<x <& <1 
pes Sen 


Kl—x) (0<&<x <1) 


be Green’s function of the reduced problem »” = 0, y(0) = »(1) = 0. 
Then the solution of the given problem may be written in the form 


rte) =e + (ce — ade +f alae) Ere), 9"(O)] a 
with 9") =a —a +{ gale) sEr(@0"(8)] ae, 


1—€& (OQO<* <& <1) 


h 4) = 
where £,(%;¢) - (Be in 55, 


The iteration to be studied is 
1 
Desi(*%) = Cy + (Cg — C4) + [| gs8) SE nl). 98) dé, 
1 
Heal) = ee — er + | geld) SLE ra( Bare’ (B)) 48 


with suitably chosen yo(x)._ The results of the general theory of successive 
approximations for Fredholm integral equations are not applicable here 
because g,(x,é) is not a continuous function of x. 

We introduce the linear space €1{0,1] of real functions of class €! 
on [0,1] and define on it the norm 


[oh = max (4 L962) + aa Lo’ I] 0), 
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where 7(x) is some positive continuous function on [0,1]. It can be 
shown that €}[0,1] 1s complete under this norm. The operator 


T9) =e, + (ce — ade +f alse) Flere)" dé 


from the set D © €}[0,1] of functions y»(x)(e €1[0,1]), with [.9(x)| < 4, 
| y’(x)| < d, on [0,1] into €1[0,1], satisfies the inequality 


|7(x) — To*) <ely —2* I, 
where 


1 1 
a = max +4(8)| my [ e(esd)| 7(E) db + maf leo(s4)l 7(é) a6 
ze(0,1 
Hence, « < |] isrequired in order that the iterations converge. Clearly, 
the estimation of « depends on the choice of the weight function 7(x). 
For r(x) = 1 we find 


ay = max r-}(x) [ Ne(né)l r(8) dt = %, 


xe[0,1] 


1 

ay = max +-1(x) [ ga(4s8)| (8) dé = 4, 
ze[0,1]} 0 

sothata < }2e(4u, + m,). For larger values of u, Collatz [1] proposes 

to use 


r(x) = 1 —&(15 — V33)x(1 — x) = 1 — 2.314x(1 — »), 


which yields « < B(44u, + me), where B = %s(9 + V33) = .3072. 
The range of permissible p,, 2g values 
for these two estimates is shown in 
Fig. 14.1 (see also [7]). 

The estimations of « depend, of 
course, also on the constants 4), [2 
which depend themselves on the 
domain D < €![(0,1] on which the 
operator Tis defined. This domain 
is not known a priori. It is often 
very useful to start with a relatively 

Fic. 14.1 rough estimate for D and refine this 

estimate in the course of the iteration 

process after more insight into the problem has been gained. As an 
example, consider this problem (see [1]): 


y” — yy + 6x =0, »(0) =701) = 0. 
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Here the iterations become 
BAS (OS = 37 Si; Ss.) ds, 


' f Bi 
Je-1\*, = 


29 


where g{x,£) is Green’s function, as before. Selecting as domain 
D):(0,1], lol <1, [o’] <3, we find au, = 3, wu, = 1, so that with 
7(x) = 1 the norm is simply 
"yi = max [3 | y(x)| — Jo" (Ce ]]. 
re(U.1} 

Hence a < 3g + 42 = 7@ <1. The “natural” initial approximation 
would be _yo(x) = 0, which would yield _y,(x) = x — x3; vet this initial 
approximation does not satisfy the conditions of Theorem 14.1, since the 
sphere 


l | 
ero SD a ig 8 ale 


isnot contained in Do. However, y,(x) = x — x*1s an admissible initial 
approximation for which y,(x) = )210(202x — 175x3 — 42x5 + 15x"); 
this is seen from the inequalities 


| yi(x) —_y2(x)| < .0075, [yi'(x) — Je’ (x)| < -0382, 


which give || 72 —_y,|| < .0607, so that || y —_y,|| < 8 llv.g — Jill < .4856. 
Therefore, the iterations { y,(x)} converge to the unique solution _y*(x) of 
the boundary-value problem in the sphere || y —_y,|| < .4856. 

For_y.(x) we obtain the error estimates 


fo 


lye —J* ll < lye —Wll < .4249, 


which means that 


|_yo(x) —y*(x)] < 142, | p'(x) —*'(x)| < 425 
and hence 


l—e« 


| y*(x)| < .535, | y*’(x)| < 2.463. 
Thus, our first estimate of the domain Dy may now be improved to 
D,: [0,1], |_y] < .535, | y'| < 2.463. 
This changes the norm in €}[0,1] to 
vil = ae [2.463 | ¥(x)| + .535 | 9’(x){] 


[r(x) = 1 again] and the convergence factor to 
_ 2.463.535 
7 8 2 


Note the considerable improvement in a. 


= .976. 
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We now repeat the previous considerations to obtain a better estimate 
for ||y2 —.y;|]|, which, in turn, permits us to refine our estimate of the 
domain again. The third step yields wu, = 2.110, uw, = .409, the norm 


yl = max [2.110] (x)] + .409 |»’(x)|], 


and the improved convergence factor a = .469. Hence, we find, 
similarly as above, 


| yo(x) — yo*(x)| < .01, | yo (x) — ye*'(x)| < .052. 


These bounds are rather remarkable in that they refer to only the second 
iterate. 

Observe that we did not use explicitly Green’s function here because 
the particular form of the equation »” —_yy’ + 6x = 0 implies that all 
iterates will be polynomials in x provided the initial approximation 
¥o(x) isa polynomial. Thus, the iteration scheme can be modified to 


Deer =I — 9%, Pear (O) = eyi(1) = 0. 


Collatz [1] showed that estimates for the convergence factor can be 
obtained also in more general cases without explicit knowledge of 
Green’s function. 
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CONJUGATE-DIRECTION METHODS AND THE 
METHOD OF STEEPEST DESCENT 


This section deals with certain iteration methods for solving the 
equation Ax = 0 where A is a linear, self-adjoint, positive definite 
operator from a Hilbert space § into itself. All considerations are based 
on the well-known observation that most linear problems in analysis 
may be reduced to the variational problem of finding extrema for 
quadratic functionals. 


14.8 Principal Theorems 


Let A be a linear, self-adjoint, positive definite operator with domain 
D in a Hilbert space § and range R < §. Then there exists an m > 0 
such that (Ax,x) > m(x,x) for all x in D, and hence A has a continuous 
inverse A-! whose domain is R and range D. Thus, for any given 
v € Rthe equation 
Ax =0 


has a unique solutionu = A“lve D. Ifx € Dis an approximation to u, 
we call r = v — Ax the residual of x and call_y = u — x the error of x. 
The functional 


F(x) = (y,r) = (u — x, A(u — x)) 


is referred to as the error functional. For any x € D and any /“e §, 
we have 6F(x,h) = —2(A(u — x),h) = —2(r,h), whence, by Schwarz’s 
inequality, 

sup {|OF(x,A)|, All = 1} = 2llrll. 


Therefore, we speak of r also as the gradient of F(x) at x. 
Theorem 14.2. Let 8 be a linear subspace of Dandx,€D. Then 


F(x) < F(x + z) 


forall z € Bif and only if the gradient ro of F(x) at xo satisfies (ro,z) = 0 on B. 
Hence, in particular, F(x 9) < F(x) for all x © D if and only if Axy = v. 
Proof. Thesufficiency part follows on observing that, for any nonzero 
ze 8, 
F(Xq 4o2) = (xq) = (z,Az) = 2 (79,2); 


whence, if (79,z) = 0 on 8, then F(x) + z) — F(x.) > 0. If F(x) < 
F(x + z) forall z € 8, then 


F(x) — F(xo + tz) = 2t(r9,z) — t?(z,Az) <0 


for z€% and sufficiently small |t| < €(x9,z). Thus, the function 
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F(x 9) — F(x 9 + tz) has a relative maximum at ¢ = 0, and this implies 


(79,Z) =0 
In particular, if 8 = Dand z = u — xp, then 


0 = (To; u— Xo) = (A(u —= Xo) u— Xo) = m(u ~~ Xo) u— Xq), 


which is possible only if x» = u, that is, if Axy = v. 

This theorem admits a simple 
x9+® | geometric interpretation when D = 
§ and § is the euclidean plane. 
Then F(x) = const defines a family 
of ellipses with center u (= A7!z), 
and the gradient of F(x) at x has the 
direction of the normal to the ellipse 
through x. The linear subspace 8 
is a straight line through the origin. 
It is clear that on x, + B there is 
always an interval [x9,x,] on which 
F(x) < F(x9) unless x9 + ® is tan- 
gent to the ellipse through Xp, that 

Fic. 14.2 is, orthogonal to 79. 

Theorem 14.3. Let B be aclosed 
linear subspace of D. There exists a unique xg minimizing F(x) on ®B for 
which ry = A(u — Xo) Satisfies (79,z) = O on B. 

Proof. Since A is positive definite, inf {F(x), x¢ 8B} =y > 0. Let 
{x,} be a sequence of points x, €® such that y, = F(x,) > y as 
n—>o. Then F(x, + tz) > y for any real ¢ and z € 8, which implies 


(z, A(u — x,))? < (yn — y) (2,42) 
For z = x, — Xm, m #~ n, this yields 


(Vis ~ Yn) a (Yn ~ y) = (Xn — ¥ my A(x, = Xm))> 


whence, by the positive definiteness of A, {x,} is a Cauchy sequence. 
Since § is complete and 8% closed, there is an x, € 8 such that x, — % 
asn — oo. Now, for any ze 8, 


I(7os2)| = [(A(u — X0), 2) < (A(u — 44), 2) + (AG — 40), 2) 
< |(A(u — x,), z)| + WAZ lle, — xoll, 


B 


Xs 


and since the right-hand side tends to zero as n — oo, this shows that 
(r5,z) = 0. Therefore, by Theorem 14.2, F(x 9) < F(z) for all z € 8, 
that is, F(x») = y. 

Sippose 8 contains an x’ # x, for which also F(x’) < F(z) for 
any z € 8. Then, by Theorem 14.2, (A(u — x’), z) = 0, whence 
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(A(xg — x’), z) = Oforallze®. Ifz =x, — x’, this contradicts the 
positive definiteness of A. Hence xp is unique. 

Theorem 14.4. Let {8,} be an expanding sequence of closed linear 
subspaces B, < D covering D, and let {x,} be the sequence of points such that 
x, € B, and F(x,) = min {F(x),x¢€8,}. Thenr, = A(u — x,) satisfies 
(r,,Z) = 0 on B,, for each n, and x, + u as n — © (an the norm topology). 

Proof. The first assertion follows trivially from Theorem 14.3. The 
proof of the second assertion parallels the proof of Theorem 14.3. 
Since F(x,) > F(x.) =-+-- > 0, there is a y >O such that F(x,) > y 
as n—» oo. Hence, as before, (z, A(u — x,))? < [F(x,) — y](z,Az) 


[eo] 
forzelU 8,, so that for z = x, — x,,,m 4#~n, 
1 


[F(xn) _ y] 1 [F'(x,) = y| es (x, — Xm A(x, _ Xa) 
which shows, by the positive definiteness of A, that {x,} is a Cauchy 
sequence. Thus, x, >%)€ Hasn— co. Givenzel 8,, there is a 


smallest integer n = n,(z) suchthatze 8, . Then (z, A(u — x,)) =0 
for n >, or (Az, u — x,) = 0 for n >, and this implies, by con- 


tinuity, that (Az,u —x,) =0 for every ze 8,. From the self- 
1 


adjointness of A follows u — x, € D (since U 8, = 0). 


1 
We have yet to show that x, = u. This follows because, as we have 
just shown, u — xy € D, and so 


(u — x9, A(u — x9)) = 0, 
which means that 
A(u — x) € Dl. 


The positiveness of A implies u = x, as required. 

The preceding theorems may be extended to more general classes of 
Operators (see, e.g., [4,5]). Our presentation follows that of [4], 
except that our proofs are based on ideas due to Friedrichs [1, 2]. 


14.9 Conjugate-direction Methods 


The method of expanding subspaces, though not immediately usable 
for practical applications as outlined above, forms the basis of many 
numerical methods. For the discussion of some of these methods, we 
assume now that A is defined on the entire space § and hence necessarily 
bounded; that is, 0 < m(x,x) < (Ax,x) < M(x,x) for all x 40 in §. 
This restriction may be removed to a certain extent (see [4, 7]). Wealso 
make the assumption that § is separable; it guarantees the existence of 
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at least one linearly independent sequence {p,}, p, € §, such that the 
finite-dimensional and hence closed linear subspaces 8, spanned by 


{Po) - + +> Pn—1} form an expanding sequence with v B, = §. 
1 


To apply Theorem 14.4, we must compute x, € 8, for each 2 > 1, 
such that F(x,) = min {F(x),x € 8,}. This is trivial for x,, because, 
by Theorem 14.3, (A(u — x,), fo) = (v — Ax, Pp) = 0; hence, since 
X, = Mpy, we have x, = [(v,f9)/(Apo,Po) ]Mo- In general, we shall have 

n—1 


Xn = > Marfr, Where the coefficients depend on n. This raises the 
k=0 
question whether conditions upon the sequence {p,} can be found under 
n—1 
which, for each n, x,,,; = x, + «,p,sothatx, = > a«,p,, with coefficients 
k=0 


a, independent of n. For, in this case, if x, has been found, Theorem 
14.3 yields, for x,,,, the equation (A(u —.x,,,), f,) = 0, from which we 
obtain immediately x,,, = %, + [(TasPn)/(APaPa) Pas Withr, =v — Ax,. 
Hence, the sequence {x,,} can be determined simply by the iteration 


xo = 0, xn+i = ans mm + yee Pa (n Pe 0). 


An answer to the question raised is given by the following two lemmas. 
Lemma 14.1, Let {x,}, x,€ 8, be such that F(x,) = min {F(x), x € 8,}. 


If for each n, x, ae Ot, p,, where the coefficients are independent of n, then for 


eachk > 0, etther 
(Ap,,f;) =O fory =0,...,k —1 


(TesPr) 
and a = where r, =v — Ax 
: (Apy, Px) : . 
or (7, Px) = 0 and a, = 0. 
Proof. Since X41 = Xe + &ePer Tet1 — Tr = —%Ap,, and hence, for 


J =0,.--5K, (TerrPs) — (Tehs) = —%(Afeps). Thus, by Theorem 
14.3, a, = (TesPe)/(APeP,) and «,(Ap,,p;) = 0 for 7 =0,...,4 —1. 
Therefore, either (Ap,,p;) = Oforj = 0,...,k4 — lora, = 0; thatis, 
(ref) = 9. 

Lemma 14.2, Let {p,}, $,€ 9, be a linearly independent sequence, 
{B,,} the sequence of subspaces 8, < § spanned by {fo,.-- 5 Pa—r} Such that 


U 8, = 8. Let 


Xo = 0, Xn+1 = a, 


(taPa) 


7 > 0, 
(Apap? 
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wherer, =u — Ax,. Ifetther(r,,p,) = Oor(Ap,,p,) = Ofork =0,..., 
n—1, then x, is such that F(x,) = min {F(x), xe %,} and x, —-u as 
n —> 00 (in the norm topology). 

Proof. Note first that (7,,p,) = 0 for k = 0, L, ..,n2—1. This 


follows readily by induction. Hence, for any x = - E,p,, we find that 


(r,,% —* n) = 0, whence, by Theorem 14.2, Fix, \ < F(x). Thus 
F(x,) = min (F(x x),xEeB 4. Theorem 14.4 implies x, > u as n + oo. 

Several remarks concerning Lemma 14.2 are in order. In the first 
place, observe that the condition (r,,9,) = 0 implies Xni1 = %,, SO that 
the minimum of F(x) in %, 1s also the minimum in %,,, © %,. This 
happens, for example, when x, = u. If § is finite-dimensional, then the 
iteration always converges in finitely many steps. Except for this 
particular case, however, the occurrence of any repetitive steps x,,, = X, 
isundesirable. Hence those vectors {p,} will be of importance for which 
(Ap,,P,) = Ofork =0,...,n —1. Then, foreverynandk =0,..., 


n—], 


n—l 
(TasPn) _ (TesPn) or (7, ~ Prs pr) = (TisPa) ers a ;(Ap;,p,.) = (TesPn)s 


so that 


(Ty Pn) 
a, = fork =0,...,n—1. 
" (Apabn) —- 

Secondly, the requirement in Lemma 14.2 that the iteration start 
with x» = 0 can be easily removed. For, if x) 4 0, we need consider 
only the problem Ax = v — Ax, = 8; theiteration#, = 0,4,,, = £, + 
[(FasPn)/(AParPn) iP, converges then to the solution d = u — Xp. 

We combine these remarks with Lemma 14.2 in the following 
theorem. 

Theorem 14.5. Let {p,}, p, € 5, be a linearly independent sequence so 
chosen that (p;,Ap,) = Ofori 4 k and that the subspaces B, < § spanned by 


{Por +++» Pn—1} Satisfy v B, = §. Then, for any xy € §, the sequence {,,} 
1 


defined by Xn41 = %, + OnPny n SO, where a, = (Tarpa)|(APaln) converges 
to the unique solution u of Ax =v. Moreover, x, minimizes F(x) on the line 


x= x, , + ap,_, a5 well as onthe planex = x9 + B, = \X EH, *% =X + 
y Espa}. The vectors, = v — Ax, satisfy 124, = 1% — %,Ap,, and 
E=0 

(rasPs) = 0, (rasPn) = (TisPn) (A =0,1,...,2—- 1). 


Hence also 


— (Ty Pn) __ 
n= Tipp) jor k = 0...i4,n-— 1, 
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The iteration described in Theorem 14.5 has been called by Hestenes 
and Stiefel [5, 6, 9, 10] an iteration by the method of conjugate directions. 


14.10 Error Estimates and Rates of Convergence 


For any given iteration by the method of conjugate directions, the 
error function F(x) decreases at each step. However, it cannot be 
computed without knowledge of the desired solution. Estimates for 
its magnitude are given by the following theorems. 

Theorem 14.6. Lety =u — x be theerror foranyx € §andr =v — Ax. 
If u(z) = (Az,z)/(z,z) for z € §, then 


(i) u(y) < wr), 
(ii) (r,7)/m(r) < F(x) < (r,7)/u(), 
(iii) Ir/e(r) < ol Ss Irll/e). 


Proof. There exists a unique positive self-adjoint operator B such 
that B? = A. Schwarz’s inequality yields (Bz,BAz)? < (z,Az)(Az,A?z) 
and (z,Az)? < (z,z)(Az,Az) whence 

(z,Az) — (Az,Az) 3 (Az,A?z) 
(z,z) (z,Az) (Az,Az) © 


For z = 9, this gives (i) and (ii). From F(x) = (9,Ay) = (yr) < I'll Ilr 
follows (111). 

Theorem 14.7. Let {x,,} be defined as in Theorem 14.5, let y, =u — x, 
andr, =v — Ax,. Then 


(i F(x.) < (1 — o1 25) Flea) 
i (Iw 3n) SA (1 — oa) Fley-1) 
(iii) (Tan) < M1 —a,_, Fay. 1): 


where 6, = (TasPn)?/[(Tas7n)(PnoPn) | < 1 and m, M are the lower and upper 
bounds of A, respectively. 

Proof. Let mu(z) = (Az,z)/(z,z), »(z) 
so thatO <m<uyu(z) <M,O0<1/M < 
putation yields 


= (A-1!z,z)/(z,z) for ze §, 
(z) <1/m. An easy com- 


F(x,) = F(x, 41) = 0, 
F(x,) H(Pr)¥ (Ta) 


whence (i) follows directly. (11) and (111) follow from (i) on using 


l 
Pi X45) = M( Wns Ina.) F(Xn41) = M (Tatis’n41)° 
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Theorem 14.7 shows that the best possible choice for the sequence {f,} 
would be that for which o, = 1 whatever n; this, however, occurs only 
when p, and 7, are linearly dependent for each n._ Iteration methods 
with this choice of the sequence {,,} are the so-called steepest-descent 
methods (see Sec. 14.13). 


14.11 The Conjugate-gradient Method 


All conjugate-direction methods assume that there is given a linearly 
independent sequence {p,}, p,, € §, such that the vectors p,, are mutually 


A-orthogonal, that is, (Ap,,,) = 0 for i k, and that U 8, = $ where 


1 
B, © §is the subspace spanned by {fo,..-., P,-1;}- From any linearly 
independent sequence {u,}, u, € §, one can construct, by a straight- 
forward adaptation of the Gram-Schmidt orthonormalization process, 
a linearly independent sequence {f,}, p, € §, of mutually A-orthogonal 
vectors: 


(Uns APr) 
= Nolo» n — Nrly n n>} ’ 
Po Nolo p /) me S (Des Ap,)* ( ) 
where 7,, # 0 are normalization factors; and it is clear that the subspace 
8, 8, spanned by {uo,.-.-,u4,_,} is B, and that U B, = § if and only if 


U 8, = §. Thus, if an initial papreinacon X»9 € § to the solution 


u ‘OE Ax = vis known, a natural choice for the sequence {u,} will be 
Uy = 1% =D — Axy, u, = Ar, (n > 1), 


eromdce it can be shown to be linearly independent and such that 


U 8, = §. No necessary and sufficient conditions on A appear to be 


bows which ensure the existence of vectors 7) for which the sequence 
{u,} has the required properties. Therefore, x, must be assumed to be 
such that these requirements on {u,}are met. The A-orthogonalization 
process then yields 

1 (A"r9,Ap,) To Ap,) 


= = n > 1). 
Po Nolos Pr nA To — Nn > (Py, AP,) ay any Poe eae (n = ) 


It is natural to expect that a conjugate-direction method using this 
sequence {p,,} and starting from x, as the initial approximation will lead 
to good results. This is indeed so [10, 11]; yet in many applications 
this advantage would hardly outweigh the amount of work involved in 
such a procedure. 
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Hestenes and Stiefel [6] observed that by a judicious selection of the 
normalization factors 7, in {p,} the iterations involved in the A-orthog- 
onalization process assume the same simple form as the iteration in a 
conjugate-direction method. This is their conjugate-gradient method. 

Note first that , appears as a formal polynomial in A so that the 
equation (Af,,p,) =0 for 14#k of A orthogonality becomes an 
orthogonality relation between such polynomials. Just as for scalar 
orthogonal polynomials, it can be shown that any three consecutive 
polynomials $44, Pa» Pa-1 are connected by a linear relation. In fact, 
we find 


Po = Nolo» Pasi at ¥n(Apn = OnPn _ €nPn—1) (n = 1), 


where 
= Mn+1 é (Ap, Ap,,) = € (Ap,,Ap,_1) 
= 0” “» 


fe Mn ; a (ParAp,) an (Pn—1Apa-1) 


Now the conjugate-direction method using {p,} and starting from x, 
yields 
= aha) (n => 0), 


Xn+l = Xn 1 OnPns a, 


~ (Apnsbn) 
M+1 = ln a,Ap,; (TrsPr) = 0 (k = 0, cee yh — I). 
Choosing ny, = 0 and y, = —a,, we obtain, after a simple calculation, 


Pn+i = —a,Ap,, =F (1 a Bn)Pn = Ba-1Pn-1 (n = 0), 


where 20 aS at (n > 0), 


and since fp, = 7; — Bofo, we have, by induction, 
Pn eg ee Ba—1Pn—1 (n = 0). 
Theorem 14.8. Let x) € § be given and suppose that {A"ro}, ry = 0 — 
Axo, 1s a linearly independent sequence such that YU 8, = , where B,, 1s the 
1 
subspace spanned by {ry,Ar9,..-, A" 419}. Define py = ryand 


Xn4+l ames Kn Ui Xn Pns Trl rae rn 4 a, Apna Prati = Tn4t a B Pn 
(n > 0) 
(Pw) (Tra pAPn) 
where Cy == ae B, So 
(Prlp,) (PrrAp,) 


Then x, —> uasn—» 0 (in the norm topology) where u is the (unique) solution 


Google 


NUMERICAL ANALYSIS AND FUNCTIONAL ANALYSIS 509 
of Ax =v. Moreover,x minimizes F(x) on B, on the linex = x,_, + ap,-1 
n—1 
as well as on the planex = x) + > Expy. 
0 


For a detailed discussion of the properties of the conjugate-gradient 
method, which would lead us too far here, we refer to [5,6]. We 
mention only the following relation, which we state without proof. 

Lemma 14.3. Consider the conjugate-gradient method of Theorem 14.8. 


Then 
(1) (r4%) =O fort ~kand fork =0,1,...,n, 
(11) (Pas%e) aa (TaTn)s 
(iii) es aoe 
i ne 2 (Tis) . 
_ asta) 
(iv) (PrsPx) a (Typ) (PusPr)- 
Moreover, «2, = ate and 8,=—- eee forn >0. 


It follows that a, 4 0 and hence that x,,, 4 x, if and only if r, 4 0. 
Thus, ifr,, = 0, then x, =u, andr, = p, = Oandx, =u forn =n, 
so that the process stops. ‘This shows that, if § is N-dimensional, the 
conjugate-gradient method terminates with xy = u. 

As mentioned before (see p. 506), the error function F(x) is decreased 
at each step in any conjugate-direction method. ‘This, however, does 
not imply in general that also the error vectors y, = u — x, or the 
residual vectors r, =Y—Ax, are decreased at each step (see [6]). The 
next theorem shows that for the conjugate-gradient method the sequence 
{lL ||} always decreases monotonically. 

Theorem 14.9. Let ©, be the convex closure of x9,..., X,3 that ts, 


€,=(xe$,x = 2 betes &, > 0, 2 é&, =1). Then x, € €, minimizes 


lu — x|| on C,. 
Proof. For any xe, we have x, —x = "S faba where £, > 0, 


and since (f,,~,) 20 for: =0,...,k —1, ic follows readily that 
(u — x,,x, — x) 20. Hence |lju — “xi? — lu — x, ||? > Oforx e ¢,. 


14.12 Rate of Convergence 


For the conjugate-gradient method the estimates of Theorem 14.9 
can be improved as follows. 
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Theorem 14.10. Let {x,}, {7,,} be defined as in Theorem 14.8 and let 
Jn =U—X,,n >O0. Then 


(i) F(x,) <OF(x,1) < oF (x) 
i) (in In) <— OF Ky-1) <— OF (9), 
(ii (ofa) 5g OF (84-1) < 35 OF (x) 


where d = 1 — m/M andm,M are the lower and upper bounds of A, respectively. 
Proof. In the notation of the proof of Theorem 14.7 we find, after a 
slight reduction, 


F(x, 41) =2 ws on 
F(x,) v(7,) 
] ] 


ss 


<a, < 


™— 


argh os ees 
M Brn) (Pn) 
Hence F(x,,;)/F(x,) <1 —m/M. Since 


(Pura) = (Tastn) + Ba—1(Pn—1» Pn—1) = (Tun) 


equality is possible only if 8,_, = 0 or g,_, = 0. The first alternative 
arises if r, = 0, that is, if the iteration terminates; the second is 
impossible so long as the iteration progresses. Thus o, < 1, which 
improves the estimate of Theorem 14.7. 


14.13 The Method of Steepest Descent 


The rate of convergence of a conjugate-direction method is best 
(see p. 507) if the sequence {p,} is so chosen that p, = 1, for every n. 
This is done in the method of steepest descent. However, Theorem 
14.5 no longer applies, since, in general, the relations (Ar,,r,) = 0 for 
1 # k do not hold. 

Let us first make the following observation. If x9 € § is an approxi- 
mation to the solution u of Ax = v and if 


and 


Il 


5 eee ee an Os a f. =o: Ax, (n > 0), 


where {«,,} is some sequence of constants, then x,,,, lies at each step in the 
direction which maximizes the change of the error function F(x). This 
follows from the fact that the differential 6F(x,,h) assumes its maximum 
over all unit vectorsh € §forh = |lr,||~'7,-. Hence, in order to minimize 
F(x) in going from x, to x,,,, we must choose a, such that 


(v — A(x, + 4,7,), ta7,) = 90, 
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that is, «, = (r,,7,)/(Ar,,.7,). This is the sequence {a,} used in the 
method of steepest descent (which also explains its name). Observe 
that F(x) is not minimized on a sequence of expanding subspaces as it is 
in any conjugate-direction method. 
Theorem 14.11. For any xo € § the sequence {x,,} defined by 
Tan) 

Xn+1 —%n eae ™! =U — Ax, (n = 0) 
converges (in the norm topology) to the (unique) solution u of Ax =v. Moreover, 
x, minimizes F(x) on the line x = x,_, + ar,_3. The rate of convergence is 
given by 


(Pn In) = m 


where m,M are the lower and upper bounds of A, respectively. 
Proof. Wereadily find 


F(%,.1) < (1 = a) Fx.) < ( = mY F(x) (n > 0), 


— 


l 
whence (Van) < r 
This implies x, — uasn-» oo. ‘To obtain the estimate for the rate of 
convergence, we use with x = 1, the inequality (x,x)? < (Ax,x)(A-x,x) 
< [(m + M)?/4mM](x,x)? proved in [3] for any x € §. 

Theorem 14.1] has been extended by Kantorovich [7] to more gen- 
eral operators A, particularly to self-adjoint semibounded operators. 
Another extension, due to Rosenbloom [12], uses in place of the polyg- 
onal paths a continuous path of steepest descent defined by a differential 
equation. 
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NEWTON’S METHOD 


In Secs. 14.14 and 14.15 we discuss Newton’s method for solving a 
linear or nonlinear equation F(x) = 0 when F is an operator from a 
Banach space %, to a Banach space 8,. The principal theorem on the 
convergence of the method for this general case parallels the well-known 
classical result for scalar equations. Our presentation does not include 
the recent generalizations of Altman [1-7], Schréder [19], and Collatz 
[9], for which we refer to the original papers cited. 


14.14 The General Theorem 


Let 8,, 8, be Banach spaces and let F be an operator with domain 
D ¢ B, and range in B,. If F is F-differentiable on a neighborhood 
(in D) of a point x» € D, we denote by /9 the bounded linear operator 
F'(x ) and by fo~} its inverse [F’(x9)]—?, if it exists. 
Theorem 14.12. Let x, € D be selected such that the following conditions 
hold: 
(i) There exists a sphere S(X9,%9) on which F is twice F-differentiable and 
|F"(x) | < 4K. 
(ii) The bounded linear operator fy = F'’ (x9) maps S(Xo,%) onto By and has 
a bounded inverse fy~! on Bq such that || fol < Bo Ifo LF (%o) ] ll < no- 
(iii) The constant hg = BonoK satisfies hy < ¥% and 


(1/hg) (1 — V1 — 2ho)n9 < 1. 
Then the sequence {x,,} defined by 


Xnti = *n fg Ae As ne 0, 
exists for all n and converges to a solution u € S(xo,79) of F(x) = 0. 
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The rate of convergence 1s given by 
fe eg Ole) iy: 


Moreover, u is unique in every closed sphere S(xg,r) < S(Xxo,7%) with r < 
(Ih) (1 -+ WT — 2h) no. 

Proof. Clearly, x, is defined and ||x, — xoll <7) < 19; hence there is 
a sphere S(x,,7;) © S(%9,79) on which (i) holds. By (ii) and the mean- 
value theorem (see p. 490), ||_fo-?( fo —J4) ll < 4 < 1,80 that the linear 
bounded operator g(x) = x — fo" [ /o(x) aye )] has an inverse g—} 
with ||g7!|| < (1 — Ay)7} (see p. 489). 

Obviously, fo[2(x)] = f(x) and (fog)! = g-f.7}, so that f,-! exists 
and is bounded. Hence (11) is satisfied for x, with 8, = B)/(1 — &,). 
To find a bound for || f,-?[F(*,)] |], we consider Go(x) = x — fy [F(x)], 
for which Gy(xy) = x, and Go(x9)k = Oforhe B,. Since fy [F(x,)] = 
Golo) — G(x,) + Go(Xo)(*1 — Xo), we obtain || fo-"[F(x1)]I] < “hon: 
whence | fz[F(a:)] | < lgll fo [F(a] < Aono/[2(1 — fy)]. Thus, 
we can take 7, = Agno/[2(1 — Ao)]. Now we easily verify (iii) with 
h, = Byn,K andr, = (1/h,)(1 — V1 — 2h,)n, + €, where e > Ois such 
that S(x9,79) > S(x1,7;) > S(x,, 17 — ©). 

It follows by induction that the sequence {x,,} is well defined and that 
each x, satisfies conditions (i) to (111) with constants f,, 7,, 4,, which are 
related by the equations 


gp. = nat 1 An , —1__fe-r 
en ne See) ar Pe 3 eh 


Hence, a simple reduction shows that 
h, < ¥a(2h,)*, Nn < 27" (Qhy)?"— 7 


Since ||x,4,1, — *,l| < 7, it follows that |lx,,, — x, ! = 27,, and so {x,} 
converges to a point u € B, for w for which |ju — x,|| < 27,. This implies 


u—xil < (/A)(1 — Vi — 1 — 2ho)o; that is, u ee by the last 
inequality in (iii). Indeed, u is a solution of F(x) = 0 because 


F(x.) (Xn41 eo x,) ap F(x,) = 0 


for each n and hence ||F(x,) || < [F’(xo) | + roX] lena — Xall- 

The uniqueness of u in every sphere S(x,,r) < S(x9,7) with r < 
(I/hg)(1 + V1 — 2h) no is proved by contradiction. If “4 u were 
another solution in 5(x,,r), then || — x oll = (9/Ay)(1 + V1 — 2ho) M95 
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where 0 < 6 < l,andG,(u) = u@. Similarly as before, we obtain |:}4 — 
xil| < (02/h,)(1 + V1 — 2h,)n, and, by induction, 


gen j 9 G2 DQ2n 
] /] — 2h —. <——. 
h ( a v 2 is < B,K < Bok 


Then ||“ — x, || ~ 0asn— o, which is absurd, since x, > uasn > «. 
This completes the proof (see also [11—15, 16]). 

There are several remarks that must be made. 

Remark 1. Theestimate || fo~![F(%9) ] || <1n (11) may be written in the 
form ||x,; — X9|| <7, which is frequently more convenient to use. Also, 
it can be replaced by the estimate ||F(x9) || <4, whence || /o-4[F (xo) ]'|| < 
Bono = No 

Remark 2. Putting py» = (I/ho)(1 + V1 — 2ho)9, observe that 
S(x9,p2) < S(x9,2n9) < S(X%,p,:). Hence the second condition in (iii) 
may be replaced by the weaker requirement S(x9,79) > 5(x 9,279). The 
sequence {x, } then converges to a solution u of F(x) = 0, which is unique 
in $(x9,279)- 

Remark 3. Condition (iii) and the radius p, for the sphere 5(x,p,) in 
which uniqueness holds are best possible, as the following example shows. 
With 8, = 8, = Rand F(x) = ex? — x + A, we find for x, = 0 that 
By = K =1, 7 =k) =A. Clearly, u = 1+ V1 — 2h exists only if 
hy = h < 4; the smaller root lies on the boundary of S(x9,p,), the larger 
on the boundary of $(xp,p,). 

Remark 4, At each step we are required to find f,—1[F(x,)] or, equiv- 
alently, to solve F’(x,)(x, — *,41) = F(x,), which may present con- 
siderable difficulties. Then the following theorem is of importance for 
applications. 


Theorem 14.13. Let condition (i), (11) of Theorem 14.15 be satisfied, and 
suppose (ili) that hy = BynoK < % and (1/hg)(1 — V1 — 2ho)n9 < 1. 
Then the sequence {x,,} defined by 


Xn41 = Xp — fo 'LF(x,)], ns 0, 


exists for all n and converges to a solution u € S(xo,u9) of F(x) = 0, where 


| 2 a Xn ll < 
n 


w=1— V1 —2hy (<1). Moreover, the rate of convergence is given by 
|| u — x, || < pet |x, os X|l. 
_ The proof is entirely analogous to that of Theorem 14.12. 


14.15 Solution of Scalar Equations 


We apply first Theorem 14.12 to the case of a single scalar equation 
F(x) = 0 where F is defined and real-valued on some bounded interval 
Ly = (Xo — To Xo + 7o)- 
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Theorem 14.14. Let x9(€ 1) be selected such that the following conditions 
hold: 


(i) F has a continuous second derwative F”(x) on Ig such that |F"(x)| < K. 
(11) FP’ (x9) ~ 0. 
(ili) Ay = K |F(xo)| [IF’(xo)|]~? satisfies hy < 2 and 
(i= V1 = 2ho) IF (x0) | [IF’(x0)]-2 < 70. 
Then the sequence {x,,} defined by 
Xn41 = Xn F(x,) [F’(x,) 77, n= 0, 
converges to the unique solution u of F(x) == 0 for which 


l GY Sate 
ju —x | < K (1 — V1 — 2ho) [F’(x9)|. 


Let us now consider a system of m equations, 
Ie $y 22 58m) = 9; hs 2y a es gms 


defined on some domain of R”. We denote by J(x) the Jacobian matrix 


(2+) at the point x = (é,,...,6,,), if it exists. In terms of the /, 
US, 


norm on R”, Theorem 14.12 can be stated as follows. 

Theorem 14.15. Let xy = (&)°,... 56°) be selected such that the 
following conditions hold: 

(1) The functions f,, have continuous second-order partial derivatives on 


2 . 

Q(xo,ro) = {x €E R™; 16, — &°| < 175, 1 < k < m)} such that mel 
Kk, ] < ts Jy k Se m, on Q (X07) - F S. 
(ii) Let S(x9) #0 and |f.(Ep, ~~ + 58m )l <9 for 1 <k <m, and 


m ee) 
max ——- <= Po- 
x F1(F)| |= fot 
(iii) The constant hy = Bo?nyKm? satisfies hy < 2 and 
(1/ho)(1 — V1 — 2he) Boro < to- 
Then the sequence x,, of vectors x, = (&",... ,&,") defined by 
Xn41 =X, — Wy ale sy AC 


where f(x.) = [fies 0% bw ee eee al his oe sba) ly Conterges to: the 
unique solution u = (u,,...,U,,) of f(x) = 0, for which 


] ————— 
we — 491 se (1 — V1 —2hy)Bon. %&&=1,2,...,m. 
0 
+ The inverse of J(x,) is denoted by J7}(x9) = (5) ) 
K/ |r=29. 


Google 


516 SURVEY OF NUMERICAL ANALYSIS 


Using the euclidean norm ||x||? = > et)" on ™, we obtain the 
following formulation. nro 

Theorem 14.15’. Let x. = (¢:°,... ,Em°) be selected such that the 
following conditions hold: 

(i) The functions f, have continuous second-order partial derivatives on 


or (\2 
S(Xo,%) = {x E R™: [|x| < 179} such that > ( os < K* on S(xq,r9). 


tJ.k dé, 0g, 
i) Det [J (x)] #0, Il flea) < no and 5 [(SE)| |< at 


(iii) The constant hy = BonoK satisfies hy < ¥2 adi 
(1/ho)(1 — V1 — 2ho)N9 < 19. 
Then the sequence {x,,} of vectors x, = (&", ~~~ &m") defined by 
Xnt+1 = %n J7 (xn) f (Xn); 


where f(xn) = [filer"s «+ Em" )a © + + Sm Er") ~ > + Fm") |, converges to the 
unique solution u = (u,,... ,uU,,) of f(x) = 0, for which 


lu — xo] ae 1 — V1 — 2h.) no. 


re 

There is an interesting example due to Kantorovich [11] for which 
Theorem 14.15’ guarantees the convergence of Newton’s process and the 
conditions of Theorem 14.15 are not satisfied. For examples and appli- 
cations to integral equations, we refer to [11]. 

In a series of papers [2-7], Altman extended Newton’s method to the 
case in which F’(x,) has no inverse; he also showed [7] that the Newton 
iteration for F(x) = ||Ax — v|| is identical with the iteration by steepest 
descent for solving the linear equation Ax = v. This yields a generaliza- 
tion of the method of steepest descent to nonlinear equations. Other 
extensions are due to Schroder [19] and Collatz [9]. 
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15.1 Some Properties of Block Designs 


The theorems on block designs which are stated in this section provide 
sufficient background for the two following sections. Proofs are not 
given, but references are provided at the end of the section for the reader 
who wishes to know the proofs. 

An incomplete balanced block design (or more briefly, a block design) 
is an arrangement of objects into sets (or blocks) in the following way: 

1. There are 5 blocks, each containing & distinct objects. 

2. There are v objects, and each object occurs in r distinct blocks. 

3. Each unordered pair of objects a,, a; occurs together in exactly 4 of 


the 6 blocks. 
There are two obvious relations on the five parameters: 


bk = or, 15.1 
r(k—1) = A(v — 1). eam 
The first of these counts the total number of occurrences of objects in 
blocks, there being, on the one hand, 6 blocks each containing k objects 
and, on the other hand, v objects each occurring in r blocks. For the 
second relation we count the number of pairs containing a specified ob- 
ject, say a,. On the one hand, a, occurs in r blocks, and in each of these 
is paired with the remaining (& — 1) objects. On the other hand, a, is 
to be paired 4 times with each of the remaining (v — 1) objects. 
We say that a block design is symmetric if 6 = v, k = r, in which case 
the second relation takes the form 


k(k — 1) = A(v — 1). (15.2) 


ee 
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With a block design D we may associate its incidence matrix A defined in 
the following way. The objects are numbered from | tov, 4,,2 = 1,..., 
v,and the blocks numbered from | to 4, B;,j = 1,..., 6. Anincidence 
number a,, is defined by 


a, = +1 if a, € B;, 
, oe (15.3) 

a,; = 0 if a; ¢ B,. 
Then the incidence matrix Ais the v x } matrix of the incidence numbers 
A = (4,;) ae Cree (aon, fee = as errr (15.4) 


The following relation on the incidence matrix is fundamental: 


mee ee. es A 
e f ° 

AAT =] - ; - = BI, (15.5) 
MH, te, en ae 


where Bis the v x v matrix with r down the main diagonal and 4 off the 
main diagonal. This is easily seen for, with AA? = B, an element 3,, 
of B is the inner product of the ith and jth rows of A. Thus 4,, is the 
inner product of the 7th row with itself, and the ith row consists of 0s and 
ls, the 1s giving the occurrences of object a;in blocks. But since a, is in 
rblocks, the number of Is in the rowisr. This gives 6,,; = 7. For );,, 
J #1, in calculating the inner product of the ith and jth rows of A, we 
have a sum of 5 products, each of which is zero, except for those ¢’s for 
which a;, = +1 and a,, = +1—in other words, those t’s for which both 
objects, a; and a,, are in block B,. But this happens exactly A times, and 
0 b,, = A, 
It is easy to evaluate the determinant of B, and this is 


det B = (r — A)*3[(v — 1)A + 7]. (15.6) 


Ifwe had r = A, then every object would be paired with every other 
object in every occurrence, and this can happen only in the trivial case 
in which & = v and every block contains all the objects. Hencer > 4, 
and so B is nonsingular, whence, since A is a v x 5 matrix, from con- 
siderations of rank it follows that 


b >v. (15.6’) 
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For the incidence matrix A of a symmetric design, more than (15.5) is 
true. We have, in fact, 


AAT = ATA =(k—AI4+4S=B, AS=SA=KS, (15.7) 


where S is the v x v matrix consisting entirely of ls. The relation 
ATA = B for a symmetric design means that the dual of a symmetric 
design, obtained by interchanging the roles of object and block, is also a 
design with the same parameters as the original (but not in general iso- 
morphic to the original). In particular, for a symmetric design any 
two blocks have Aobjectsincommon. Hence, from asymmetric design 
we may, by deleting a block B and the objects in it, obtain a smaller, 
nonsymmetric design, called the residual design. Also, the objects in B as 
they appear in the remaining blocks will themselves form a design, called 
the derived design. Thus in the following design, with 0 = 6 = 25, 
r=k =9,A = 3, if we delete the block 17,..., 25 and its objects else- 
where, we obtain the residual design with objects 1,..., 16 with param- 
eters v = 16, 6 = 24,r = 9, k = 6, A= 3. The objects 17,..., 25 
in the first 24 blocks form the derived design with v = 9, 6 = 24,r = 8, 
k= 3..7-—2. 


v= 5b = 25, r=k=9, ) ae 
12 5 6 11 12 17 20 23 24 5 7 14 16 18 20 25 
12 9 10 15 16 17 21 25 56 9 10 13 14 17 #18 #19 
12 7 8 13 14 17 22 24 #57 #9 11 #13 15 20 21 22 
3 4 7 8 9 10 17 20 23 5 8 9 12 13 16 23 24 25 
3 4 11 12 13 14 17 21 25 7 8 11 12 15 16 17 18 19 
13 5 7 10 12 18 21 24 6 8 10 12 14 16 20 21 22 
13 9 11 14 16 18 22 23 6 7 #10 11 14 «15 23 24 25 
13 6 8 13 15 18 20 25 
24 6 8 9 11 18 21 24 (15.8) 
2 4 10 12 13 15 18 22 23 
3 4 5 6 15 16 17 22 24 
14 5 8 10 11 19 22 25 
1 4 9 12 14 15 19 20 24 
14 6 7 13 16 19 21 23 
23 6 7 9 12 19 22 25 
2 3 10 11 13 16 19 21 24 
23 5 8 14 15 19 21 23 


17 18 19 20 21 22 23 24 25 


For a fuller account of this part of the theory, see Mann [4]. But given 
a design with the parameters of a residual design, it is not always possible 
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to adjoin a block and its objects to obtain a symmetric design. The 
following design is an example due to Bhattacharya [5]: 


v=16, b=24, r=9, k=6, A=3 
127 81415 1 4 7 811 +16 
357 8 ll 13 #2 4 8 10 12 14 
23 8 9 13 16 #5 6 8 10 15 16 
358 9 12 14 +1 6 8 10 12 13 
167 912 13 TY 2 3 0 12 «15 
25 7 10 13 15 2 6 7 9 14 16 (15.9) 
3 47 10 12 16 #1 4 #5 13 14 16 
3 46 13 1415 2 5 6 11 12 16 
457 912 15 | 3 9 10 15 16 
249 10 11 13 4 6 8 9 11 15 
3 67 10 11 14 #=1 5 9 10 ll 14 
123 4 5 6 Il 12 13 14 15 16 


In this example it is clearly impossible to adjoin a further block and nine 
further objects in such a way as to obtain a symmetric design with 
0=6 =25,r=k =9, A = 3, since, as indicated by the underlines, 
two blocks contain the four objects 1, 6, 12, 13 in common, and this is 
impossible in a symmetric design with A = 3. 

There are many things we should like to know about block designs. 
Fundamentally, we should like to know for which parameters 2, 5, 1, k, a 
satisfying (15.1) a design exists, how many designs exist if there are any, 
and what can be said about the structure or automorphisms of such 
systems. A symmetric design with A = | is a finite projective plane, 
and writing k — A = n, we have the parametersv = b = n*? +n +1, 
r=k=n+1,4= 1. Finite planes are known to exist whenever n is 
a prime or prime power, but no finite plane has as yet been found with n 
an integer, not a prime or prime power. 

Chowla and Ryser [1] have shown that a necessary condition for the 
existence of a symmetric design is that (1) if vis even, thenn = k — A 
must be a square, or (2) if v is odd, the diophantine equation 


eS (k= Aya tale 42 (15.10) 


must have a nonzero solution in integers. This and some similar non- 
existence theorems are based on the Hasse-Minkowski theory of the 
rational equivalence of quadratic forms. 

A further type of theoretical approach is due to Connor [2]. Equation 
(15.5) for the incidence matrix is equivalent to the following representa- 
tion of quadratic forms: 


Petes + £7 Hr(xy? +°-°+ +42) +24 dxx, (15.11) 
i<j 
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where : 
i=1 


and the a,, are the elements of the incidence matrix. If we specify some 
of the blocks, say B,,..., B,, then, writing Q for the right-hand side of 
(15.11), we have 


Q*=Q-L- -BaBa te + (15.12) 


Hence Q* must be a semidefinite form if it is to be possible to complete 
the design by finding Z,,,,...,£,. Connor has found a test for Q* of 
the following type. Let S;, be the number of elements common to 
blocks B, and B,,j,u = 1,...,¢; then pute,;; = (r —k)(r — A), c;, = 
Ak — rS;, (7 #4). Then a necessary condition that it be possible to 
complete B,,..., B, to a full design is that 


det |C;,,| > 0. (15.13: 


This relation is a consequence of the fact that Q* must be semidefinite. 
Connor gives somewhat more than (15.13), but this is the main con- 
clusion. 

It may happen that a block design possesses an automorphism. Thus 
the following design, 


ie ee r=k = 4, A=1 


0 1 3 9 
1 2 4 10 
2 3 5 ll 
3 4 6 12 
4 5 7 0 
5 6 8 | 
6 7 9 2 (15.14) 
7 8 10 3 
8 9 ll 4 
9 10 12 5 
10 11 O 6 
ll 12 1 7 
12 0 2 8, 


clearly has the automorphism of order 13, x + x + 1, where the objects 
are regarded as residues modulo 13. In general, we may have a sym- 
metric design whose objects may be regarded as the residues modulo 7 
where x — x + 1 is an automorphism cyclic of order v on both the ob- 
jects and the blocks. Such a design is completely determined by a single 
block, and, say, above 2, 3, 5, 11 (mod 13) determines the entire design. 
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The property of & residues modulo », a,,..., a, (mod v) determining 
such a design is the following: every residue d # 0 (mod v) has exactly A 
representations in the form 


d = a, — a; (mod v), (15.15) 


where a, and a, are in the set a,,..., a, (mod v), called a difference set of k 
residues modulo v. 

We say that a residue ¢ prime to v is a multiplier of the difference set if 
x — tx (mod v) is an automorphism of the design; or, what is the same 
thing, for an appropriate s, ta,, tay,..., ta, (mod v) area, + 5,...,q 45 
(mod v) in some order. 

The multipliers of a difference set form a multiplicative group modulo 
v. Every known difference set possesses some multiplier different from 
the identity modulo z, and there is some evidence that this may be uni- 
versally true. A theorem on the existence of multipliers is the following, 
whose proof may be found in [3]. a 

Theorem. Jf a,,..., a, (mod v) are a difference set and if p is a prime 
dividing k — A such that p+ vand p > Ad, then p ts a multiplier of the difference 


Set. 
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15.2 Systematic Search for Block Designs 


In a projective plane of order 8 there are 8 + 1 = 9 points on every 
line and 9 lines through every point. There are, in all, 73 points and 73 
lines. This is the symmetric block design with v = 6 = 73,k =r = 9, 
andda=1. 

Let A, B, C be the vertices of a triangle in a plane ofordern. Call AB 
the line at infinity L,, AC the line x = 0, and BC the line y = 0. Label 
the (n — 1) remaining lines through A asx = 1,x = 2,...,% =n —1 
in any order and label the (n — 1) remaining lines through B as y = 1, 

.,;y =n-—linany order. A point P not on L,, will then lie on a 
unique line x = a and a unique line y = 6. Then assign to P the co- 
ordinates (a,5). 
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The (2 — 1) lines through C = (0,0), apart from AC and BC, will 
intersect each of x = 1,...,x =n — 1 once and each of y = 1,..., 
y =n—lonce. Sucha line L will intersect L,, in some infinite point 
and will also contain (0,0) and (n — 1) points (7,;), where : and 7 take 
values 1 ton — 1. With JZ, associate the permutation 


(" | emcee — 

0, Ogi konye Tie 

if (0,0), (1,a,),..., (n — 1, a,_,) are the finite points of L. The second 
rows of the (n — 1)L’s give the array 


0 Qy15 A195 cee y Ayn] 
O ag, G205 seey Gant (15.16, 
0 An—-11»9  Fn-12) +++ 3 @n-1.n-19 


and this (deleting the Os) will be a Latin square of order n — 1; that is, 
every digit 1,...,2 — 1 occurs exactly once in each row and exactly 
once in each column. 

We may Start a search for planes of order 8 by taking a list of 7 = 7 
Latin squares. To within equivalence for geometric purposes, there are 
147 such squares, of which 146 were listed by Norton [5]. Sade [6] 
found an omission and verified that, with this square added, Norton's 
listiscomplete. Indeed, it is necessary to investigate only 100 of the 147 
squares, since these are listed in terms of the number of inéercalates, an 
intercalate being a subarray of the type 


a a a a ee 
(15.17) 


hb. ae A 


and a simple argument shows that for an appropriate choice of 4, B,C 
in the plane we shall find a square with at most 12 intercalates. Thus 
we need examine only squares | to 99 of Norton’s list and the omission 
found by Sade. 

All 100 squares used in the search can be normalized, so that the first 


two lines of (15.16) read 
01234567 ae, 
Oo So aS ee 1. (1516. 


Google 


DISCRETE VARIABLE PROBLEMS 525 


Thus the 100 squares consist of these two and a further set of five lines. 
An attempt to add further lines ofa plane to each of these 100 starts was 
made in the summer of 1955 on SWAC. We already have the first line 
of (15.18) as a line through (1,1). There will be in addition the known 
lines x = 1 and y = 1 and six further lines of the following form: 


(15.19) 


De be De Be be be 
Dey de by be De 69 
be by by by oP De 
Dg be by OF Dg dy 
Dy be D dy De De 
bg wa by by bg be 
De dy by by Be 


Here the first five are the lines joining (1,1) to the points (2,3), (4,5), 
(5,6), and (6,7) of the second line of (15.18), and these are, of course, 
distinct lines, since no line of (15.19) may intersect the second line of 
(15.18) twice. The sixth line of (15.19) is, of course, the unique line 
through (1,1) parallel to0 2345671. If we succeed in finding the 
lines (15.19), we may then add further lines through (2,3), these being of 
the form 


X X 3X 4X X X 
Xx X 3X X 5 X X 
X X 3 X X X 6 X (15.20) 
X X 3 X X X X 7 
X X 3 X X X X X 


The first four of these are the lines through (2,3) intersecting the first 
line of (15.19) in (4,4), (5,5), (6,6), and (7,7), and the last is the parallel. 
For only the first of the 100 squares was it possible to add all 11 lines of 
(15.19) and (15.20), and in this case there were only four ways in which 
this could be done. ‘The rest of the work was easily done by hand, two 
of the four answers being impossible to complete and the other two both 
leading to the same plane, this being the known desarguesian plane of 
order 8 and by this study shown to be unique. ‘The criterion for accept- 
ability of a line is, of course, that it should not have as many as two 
points in common with any previous line. Thus for a line 


ee ae 


> 
Ay @, Ag Ag Ag As ag a, 


thea’s must be 0, ..., 7 in some order, and this sequence of eightnumbers 
may not agree in more than one position with any line already taken. 
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Let us illustrate the nature of the search with square 70: 


OMm™~ MOO tN 
MOMnm AN OO + 
TOMI N OM ES 
COST NOM = LE 
N OO str a i © 
m= NO HMO Mm 


ooooeooeo } 


There are 21 ways of adding the lines (15.19) to square 70. These 


are as follows: 


70.1: 


COowO KOM NY 


im NO Om + 


STtONnNOM”™ 


NOW ™ O © 


™m™~ TOWN © 


COOm Of Ow 


=“ et = 4 = = i 


Om mn +t OE 


70.6: 


MOONoO ss 


FONMOM™ FY 


om toon 


Mm™ NOOO MO 


NTO M™ FY © 


mom Wot © 


De BO cre EE cee BE eee OE ee ME ee | 


OwWMW OO tN Mm 


KN WO 09 © IO 
MOO Non kt 
On HON CSO 
NOM ™ O&O © 
m~ tT OM ON 
Oo mr fwM © 
ood tte ee 


Om con +rEe 


70.7: 


WOOMWMO tN 


FTMOONM™ &S 


COO tON™ 


~~ NWO OO © 


tcQ FN ™ © © 


Oormr~ tow 


N™OMOMW 


70.2: 


MON oO s+ 


NO TO Mm HO 


Mm NOOO tH © 


Oca wmm~ Own 


Ort OwWOn me 


Om Oo HM O 


see es eS et 


TOM NO OE 


70.8: 


+t+OMN OO WKY 


Own Oe 


m~ OO +# ON © 


NOW OOM 


mF OM ON 


Om Oo KM © 


=“ == = FS et 


UON™~WY HOES 


70.3: 


(4 OO 09 CO wh ~b 


tHtON Or wo 


I~ N FTO MO © 


Coa mom ON 


OF OWN 


Own m™~ tO © 


a set st tot 


Mm rm~ ON HO 


70.9: 


ONOMO xs 


COMO tN ™ FO 


NN ™ OO +t © 


m~ OM OON 


IM KN OO Mm 


caAaOm TFWM O 


=“ = NSO 


ti OM N OS 


70.4: 


OM OO Mminwo * 
MO OmrNnN ~~ ES 


ONMORtE 


(So C le 


& 
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mon atmo © 


NO HO™ EY 


™NOOC 


Omr~ oO can © 


OrFtrOonwmm 


mMOooonrwwrwWM 


tom UO ON 


70.16: 


co oN iM OO 


2 


OoOON™ OS 


se tT ON ™ 


Tum © om © 


tFNE OW 


3 67 0 4 


OO hr + WH © 


] 


™m™ OO STN 


i) 


70.11: 


NOWMAM + 
TON Or YH 
on+tonn 
~OMNOM 
OtreOMAO 
mMiNneton 
Bags ee eg ei 


WN CO rm +O 


70.17 


NF oN OO 


um CON ™X © 


CAMm ON + 


OW OOn 


Tomo” 


3 27 0 4 5 


oor tw 


] 


™m™N Fw Oc 


70.12: 6 


OOM MON + 
SON t™m™ 
N™ tO OM © 
~ OW OON 
MtHON OPM 
Aw ~ © HO 
ee ee 


SN OM™ WO OES 


70.18: 


Or om wm © 


2 


mM Co ON ™ © 


NS OO st 


Om OO ON 


TOWN ™ 


3 67 0 4 


moO ArtH OW 


] 


ON M™ HO 


70.13: 5 


ONMOM 


COOn tm OH 


N™ HOMO © 


™~ OM MOON 


OFON OH 


Ou) OE +H © 


70.14: 6 


70.15: 6 


(So C le 


& 
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If we now endeavor to add the lines of (15.20) to 70.1, the first set of 
lines (15.19), we find exactly two possibilities for lines of the form 


XX3X4X XX, 
and for each of these one line of the form X X3 X X5 X X. The two 


Cases are 


5 


6 3 1 2 0 0, 
7 4 3 


7 2 
9 0 2. 


In neither case is it possible to add a line of the form X X3 X X X 6X, 
and thus 70.1 cannot be completed to a plane. The other starts go out 
in much the same way. 

It is worth noting that it was found by hand computation that the 
starts would go through the lines (15.19) in some quantity, the average 
being around 20 cases for each square. But it never appeared possible 
to add more than two or three more lines. A machine program for add- 
ing more than the 11 lines would have been much harder to write and 
harder to get onto SWAC, which has only 256 words of high-speed 
memory. And as matters turned out, the amount left to hand computa- 
tion was very small. Thus in this case, as in many others, the right 
proportion of hand and machine calculation provided by far the most 
satisfactory solution. 

The program contained two main subroutines, the first the calculation 
of the lines (15.19) the second those of (15.20). Each of these was suff- 
ciently long so that it was necessary to store one on the drum while the 
other was being used, but the number of completions of lines (15.19) was 
sufficiently small (as remarked above, about 20 for each run) so that the 
time consumed by transfers onto and off the drum was negligible. 

Each digit was added singly. The method is best explained by an 
illustration. Suppose that, in adding the lines (15.19), we have the line 


7 4 6 4 
1 6 1 6 


613270 4 5 
Ka Nd: A 


We add X35, X5, X¢, X7, Xg, 4; in this order, since X;, is the least restricted 
digit. Suppose that we have also taken X, = 6. 


61327045 
Kel 6 & Nex XY. 


To find X,, we construct a little table of digits which may be used as .\,. 
Here X, # 1, 6,4, the digits already used in the line. Also, X, # /, 
the digit above it, since we may not have two lines of the form 
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X1XX7X XX. Wealso look at a catalogue made from square 70 
and indicating exclusions and find 


X 1 X X 4X X X=0 12 3 4 5 6 7 

X X 6 X 3 X X X=0 76 5 3 4 1 2 

X X X 45 X X X=0 23 45 67 ~«2)1 
Hence we have X, 4 4, 3,5. The combined exclusions are X, + 1, 3, 


4,5, 6, 7. Thus X, may be 0 or 2, and we represent this information 
positionally using 8 bits of a word, these being in this case 


101 00 0 0 0. 


The ls are in positions 0 and 2, which indicate that these digits are 
permissible. The Os indicate that the remaining digits may not be used. 
These bits are computed by Boolean operations, since it will be noted 
that 4 has been excluded twice as a possibility for X;. A shift to the left, 
with testing for overflow, indicates which is the smallest available digit. 
This is then used (here the zero) and the value then discarded, but re- 
maining values are retained to be used on a backtrack. Going forward, 
we recompute the exclusions. Thus, if another value is used for X3, we 
recompute the exclusions for X, even though some of these remain valid— 
for example, X, + 4, 5. 

A flow chart of the program is given in [4, p. 191]. 

A block design with k = 3, A = 1 is called a Steiner triple system. 
The systems with v = 15 were sought by systematicsearch. The full set 
of parameters is 6 = 35,v = l35,r=7,k =3,4 = 1. 

One such system is the following: 


1 2 3 

1 45 2 4 6 3 4 10 

1672 5 83 5 7 

1 892 7 9 3 611 4 712 5 614 6 812 7 811 
11011 21012 3 815 4 813 5 910 6 915 7 10 15 
11213 21114 3 913 4 914 51113 61013 7 13 14 
11415 213 15 312 14 411 15 512 15 81014 911 12. 


Suppose that, in constructing such a system, we have taken all triples 
involving 1, 2, and 3, say asabove. Wethenconstructa15 x 15 table 
showing which objects have appeared with which. Thus 4 has appeared 
with 1, 2, 3, 5,6, 10. Hence the triples to be with 4 will be of the form 


4 X, X, 
AX, X; 
4 X, X, 
4 Xy Xs, 
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where X,,..., AX, are 7, 8, 9, 11, 12, 13, 14, 15 in some order. The 
order of these four triples is immaterial, and so we may take X, = 7 
without loss of generality. For X, we may take any one of 8, 9, 11, 12, 
13, 14, 15 which has not already appeared with 7. We begin by taking 
the smallest, X, = 8. Next X, is taken as the smallest value not used, 
then, in turn, X, and X,. We continue until we reach a conflict; then 
we backtrack, using the next higher value. We give the initial steps: 


4X X 47 X #4 7 8 4 7 8 4 7 8 
4X X 4X X 4X X 49 X 4 9 11 
4X X 4X X 4X X 4X X 4X X 
4X X 4X X 4X X 4 X X 4 X X 
4 7 8 4 7 8 4 7 8 
4 9 11 4 9 ll 4 9 11 
4 $12 XxX 4 12 13. (conflict) 4 12 14 (conflict) 
4X X 4 X X 4 X X 
4 7 8 4 7 8 4 7 8 
4 9 11 4 9 il 4 9 11 
4 12 15 4 12 15 4 12 15 
4 X X 4 13 X 4 13 #14. 


This gives a set of permissible triples with 4, which is then stored, and 
the process continues with 5 and these triples. Backtracking to the 4s 
later on, we backtrack to 


4 7 8 
4 9 Il 
4 12 X 
4 X X 


but find that all values have been used with 12. We now backtrack 
further to 


rhe fe of ofa 
by by Os 
be De be 00 


and proceed to 


oa ofa ofa ote 
Py Py Os 
Ne 


An attempt was made to restrict the initial starts as much as possible, so 
as to avoid duplication, without, of course, missing any possible system. 
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The triples with | may trivially be taken as above. Then with 2 there 
are only four essentially different possibilities: 


A B C D 
2 4 #6 2 4 6 2 4 #6 2 4 6 
2 35 7 2 35 7 2 5 8 2 5 8 
2 8 10 2 8 15 2 7 9 2 7 10 
2 9 Ii 2 9 10 2 10 12 Z G12 
2 12 14 2 11 12 2 11 14 2 11 14 
2 13 15 2 13 14 2 13 #15 2 13 15 


For A there are three sets of four triples of the form (1,a,5), (1,c,d), (2,a,c), 
(2,5,d). This A type is called a triple tetrad (the term is due to Cole [7]). 
B has one such set and is called a single tetrad. C has two sets of the form 
(1,a,5), (1,¢,d), (1,e, f), (2,@,¢), (2,4,e), (2,d, f) and is called a hexad. D 
has no combination of interlocking triples short of the full subsystem 
[excluding (1,2,3)] and is called a duodecad. Thus, attempts to limit 
duplication were made primarily in terms of triples involving 3. 

Despite attempts to eliminate duplication, several thousand complete 
solutions were found. The problem then was to determine which were 
isomorphic. For this J. D. Swift used two ingenious programs. The 
set of all triples involving two letters 1 and j will have one of the four 
patterns A, B, C, D listed above, and the type of the pattern will be un- 
changed by substitution. Thus the set of all patterns A, B, C, D is an 
invariant of the system. There will be (15+ 14)/2 = 105 such patterns 
A, B,C, D, and a necessary condition for the isomorphism of two systems 
is that they both have the same number of A’s, B’s, C’s, and D’s. (The 
system given as an example has only C’s and D’s.) A first program cal- 
culated the patterns A, B, C, D, and it was found that there were in all 
80 different sets ofsuch patterns. Asecond program took a single system 
from one of the 80 sets and tried to set up an isomorphism with the 
remaining systems in that set. This succeeded in every case, and so it 
was shown that there are exactly 80 Steiner triple systems of order 15. 
These had been listed by hand previously by White, Cole, and Cummings 
[7] and by Fisher [1]. All 80 systems were obtained by Cole and his 
co-workers, but it was not clear that their methods were exhaustive. 
Indeed, their long monograph uses several methods, and most of the 
methods used are certainly not exhaustive. Fisher used an exhaustive 
method but in fact obtained only 79 of the 80 systems, missing the one 
listed here. 

Let us turn to the search for difference sets (see [2]) carried out on 
SWAC. We seek k residues modulo 2, 


Qy, Qy,..., a, (mod v) 
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such that every residue d ~ 0 (mod z) has exactly A representations into 
the form 


a, — a, = d(modv). 
Here we have the relation 


k(k —1) =(v —1). 


We take k < v/2 (as we may, since the complement of a difference set is 
also a difference set). The range 3 < k < 50 was studied, and this in- 
volved 268 choices of parameters. Of these choices 101 correspond to 
no design, because of the criteria of Chowla and Ryser, and thus a 
fortiori to no difference set. Of the remaining 167, difference sets were 
found in 46 cases, and in only 12 cases does the existence of a difference 
set remain undecided. 

A variety of hand procedures made it possible to find some difference 
sets and show in other cases that none existed. But ina number of cases 
searches were carried out well beyond the scope of hand calculation. 
Every case treated on the machine involved a situation in which there 
was a multiplier and in which some block was fixed by the multiplier. 
We illustrate the general method with an example. 


v = 121, k = 40, A = 13. 


Here 3 divides n = k — 4 = 27 and can be shown to be a multiplier 
fixing a block. Hence the residues of the difference set occur in sets left 
unchanged by multiplication by 3. A first program on SWAC calculates 
these sets. In this case we have the following sets: 


A 0 

B 11, 33, 44, 55, 99 
C 22, 66, 77, 88, 110 
RI 1, 3, 9, 27, 81 


. R2 4, 12, 36, 108, 82 
R3 5, 15, 45, 14, 42 
R4 16, 48, 23, 69, 86 
R5 20, 60, 59, 56, 47 
R6 25, 75, 104, 70, 89 
R7 26, 78, 113, 97, 49 
R8 31, 93, 37, 111, 91 
R9 34, 102, 64, 71, 92 
R10 ~—-.38, 114, 100, 58, 53 
R11 ‘67, 80, 119, 115, 103 
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N} 2, 6, 18, 54, 41 

N2 7, 21, 63, 68, 83 

N3 8, 24, 72, 95, 43 

N4 10, 30, 90, 28, 84 
N5 13, 39, 117, 109, 85 
N6 17, 51, 32, 96, 46 
N7 19, 57, 50, 29, 87 
N8 35, 105, 73, 98, 52 
N9 40, 120, 118, 112, 94 
N10 61, 62, 65, 74, 101 
N11 76, 107, 79, 116, 106 


In the second SWAC progran, an initial set is taken, differences are 
tallied, then a further set is taken, and new differences are computed and 
tallied. Ifany difference occurs more than 4 times, the last set added is 
discarded, together with the differences it contributed, and then a further 
setis tried. After several hours’ running on the machine, three solutions 
were found, of which two were isomorphic, but the run was stopped since 
it appeared that it would take perhaps a hundred hours to make the 
complete run. At this stage it was decided to examine the situation by 
hand. Considering the difference set modulo 11, let us suppose that dy 
residues are congruent to0 (mod 11). The residues 1, 3, 4, 5, 9 (quad- 
ratic residues) will occur equally often, say x times, and the residues 2, 6, 
7, 8, 10 (quadratic nonresidues) will occur equally often, say _y times. 
Then we have two relations: 


Ay + 9x + Sy = 40, 
Ay(x + y) + 2x? + Sxy 4+ 2y? = 143. 


The first of these says merely thatk = 40. The second counts differences 
congruent to 1 (mod 11), and each of 11 such residues modulo 121 must 
occur 13 times. The solutions of these equations are 


X=) = S, a = 0, eS. y 
a,= 1, R= SD. ye 2, a, = 1, a2. y= 


Hence, multiplying the difference set by a suitable value, we may 
assume x = 9; that is, the difference set includes exactly five of the sets 
Rl,..., Rll. Moreover, these sets, under multiplication by quadratic 
residues modulo 121, are permuted in a cycle of length 11. This gives 
us, to within isomorphism, 

1 11-10-9-8-7 


i og 5 ee 


of choosing these sets. With 42 separate starts, each start went through 


Go gle 


534 SURVEY OF NUMERICAL ANALYSIS 


very quickly on the machine, and a total of four nonisomorphic solutions 
were found. ‘These are as follows: 
vy = 121), = 40, 2213 

l. 1, 3,4, 7,9, 11, 12, 13, 21, 25, 27, 33, 34, 36, 39, 44, 55, 63, 64, 67, 
68, 70, 71, 75, 80, 81, 82, 83, 85, 89, 92, 99, 102, 103, 104, 108, 109, 
115, 117, 119. 

2. 1, 3,4, 5,9, 12, 13, 14, 15, 16, 17, 22, 23, 27, 32, 34, 36, 39, 42, 45, 
46, 48, 51, 64, 66, 69, 71, 77, 81, 82, 85, 86, 88, 92, 96, 102, 108, 109, 
110, 117. 

3. 1,3, 4, 7, 8, 9, 12, 21, 24, 25, 26, 27, 34, 36, 40, 43, 49, 63, 64, 68, 70, 
71, 72, 75, 78, 81, 82, 83, 89, 92, 94, 95, 97, 102, 104, 108, 112, 113, 
118, 120. 

4. 1, 3,4, 5, 7,9, 12, 14, 15, 17, 21, 27, 32, 36, 38, 42, 45, 46, 51, 53, 58, 
63, 67, 68, 76, 79, 80, 81, 82, 83, 96, 100, 103, 106, 107, 108, 114, 
115, 116, 119. 


Of these solutions, the first represents the three spaces in a four-dimen- 
sional projective space over the field with three elements. The other 
three do not do so (as is easily seen by checking the intersections of three 
sets) and so represent different designs. It is not known whether or not 
2, 3, and 4 represent the same or different designs. 


REFERENCES 


I. R.A. Fisher, An Examination of the Different Possible Solutions in a Problem 
of Incomplete Blocks, Ann. Eugenics, vol. 10, pp. 52-75, 1940. 

2. M. Hall, Jr., A Survey of Difference Sets, Proc. Amer. Math. Soc., vol. 7, pp. 975- 
986, 1956. 

3. M. Hall, Jr., and J. D. Swift, Determination of Steiner Triple Systems of Order 
15, Math. Tables Aids Comput., vol. 9, pp. 146-156, 1955. 

4. M. Hall, Jr., J. D. Swift, and R. J. Walker, Uniqueness of the Projective Plane 
of Order Eight, Math. Tables Aids Comput., vol. 10, pp. 186-194, 1956. 

5. H. W. Norton, the 7 x 7 Squares, Ann. Eugenics, vol. 9, pp. 269-307, 1939. 

6. A. Sade, An Omission in Norton’s List of 7 x 7 Squares, Ann. Math. Statist., 
vol. 22, pp. 306-307, 1951. 

7. A. S. White, F. N. Cole, and L. D. Cummings, Complete Classification of Triad 
Systems on Fifteen Elements, Afem. Nat. Acad. Sci., vol. 14, 2d mem., 1925. 


15.3 Suggested Numerical Analysis of Some Discrete 
Problems 


Certain problems in group theory might profitably be attacked by 
machine methods. We consider first the problem of investigating a 
group G generated by elements subject to certain relations. Since the 
word problem for groups is unsolvable, it cannot be expected that all 
group problems of this type can be treated, but most problems which 
arise naturally can be expected to be withinreason. The main difficulty 
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encountered is the sheer volume of the calculations involved, but this is 
the sort of work in which computing machines have made their greatest 
contribution. Ifsuch calculations were to be carried out, several ques- 
tions would remain to be settled—mainly, how much of the intermediate 
work should be recorded and in what form the results should be given. 
But these issues are common in all numerical analysis. 

Given a group G and a subgroup H, a procedure is given here (1) for 
finding generators for H in terms of generators for G and representatives 
of left cosets of H in G and (2) for expressing relations on the generators 
of G as relations on the generators of H. For example, if it is believed 
that G has a specific finite order, then we may choose H as a subgroup 
which will reduce to the identity if the hypothesis is correct. Or it may 
be to our advantage to choose some subgroup H which is believed to be 
of a relatively simple kind. 

It is known [4] that a group G generated by a finite numberof elements 
a,,...,4,in which z‘ = | for every z € g is finite and that there is a 
largest group B(4,r) such that every other group with this property is a 
homomorplue image of it. Trivially, B(4,1) is of order 4, and it 1s 
known that B(4,2) is of order 2!, but the order of B(4,r) is not known in 
general. Calculation of, say, B(4,3) should be of considerable value in 
determining this. The procedures are illustrated with B(4,3) in mind. 

Let H be a subgroup of a group G and let the following be the decom- 
position of G in terms of left cosets of H: 


G =e) + Akg po + Ax, ee (15.21) 

For g €G, let us define 
g(g) =x, if ge Hx,. (15.22) 
The notation g = x, is sometimes used. We note that, if g e Hx,, then 
g = fx, where h € H, (15.23) 


and consequently 
eg(g) t =hed. (15.24) 


Theorem. If G 1s generated by a,, a.,..., a, and uf (15.22) gives the 
decomposition of G into o cosets oy. H, then Hts eae by uy, = X,4,9(x,a,)—}, 
t= 4, n,k = 1, 

Corollary, If Gi 1s ae by r elements and H 1s of index n in G, then 
H 1s generated by at most rn elements. 

Proof. Consider an arbitrary element A, of H. Then f, may be ex- 


pressed in terms of the generators a,,..., a, of G in the form 
hy = bbg-+ + 5, (15.25) 

where each 5; = a,£, « = 41 forsomej. We certainly have 
g(1) = 1, p(hy) = 1, (15.26) 
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since by hypothesis A, e H, and the identity has by convention been 
chosen as the representative of H. Then the following is an identical 
relation: 


hy = P(1) by (41) * + (1) bap (19) - G(byb0) ++ + Gb, + Oye) bya 
X p(y by_y) 7? p(y b,_1)b, (5, ce 7) (15.27) 


since between 4, and b,,, we have inserted y(b, :-: b)-9(b, 6h) =] 
and (1) = 1, g(b,--- 5,)—! = g(hy)-! = 1. Thus we have expressed 
hy as a product of factors of the form 


u = 9(b, °° + 6, _1)b.p(d, ++ 5,)~. (15.28) 


But if 6, = a, and 9(6,---6,_,) = x,, then u in (15.28) is of the form 
x ,a,y(x,a,)—1, whereas if b, = a,—1 and (db, --: 6,) = x,, then win (15.28) 
is of the form [x,a,9(x,a,)~1]-!. Thus A, is expressed in terms of elements 
x,a,9(x,a,)—1, and, by (15.24), we note that these are all in H. 

It is natural to construct coset representatives for H from previously 
constructed representatives by adding on a generator 4a, or its inverse 
a,—1. This kind of construction is always possible. More precisely, a 
set of elements S in a group is called a Schreier system if, whenever x € S, 
x = b,-:--b,, each 5, = some a;‘, « = £1, then also 6,--:-5,_,€S. 
Coset representatives may always be taken as a Schreier system, and 
indeed if elements are ordered by length and by an alphabetical order 
for the same length, then, choosing the earliest element as representative 
for each coset, the coset representatives automatically form a Schreier 
system. With a Schreier system of representatives, there are various 
further results which hold for the generators of H. For example, if G is 
a free group, then the generators u,, ~ 1 of H are free generators (see 
[1, 2, 3}). 

Suppose that Gis generated by elements a, 5, c, and let H be the normal 
subgroup of index 8 such that G/H is the elementary abelian group of 
order 8. Let us take 1, a, b,c, ab, ac, bc, abc as coset representatives of H. 
We find the following 17 generators u for H: 


x; x,a ¢(x,a) u = x,ap(x,a)~" 

l a a ] 

a a ] a? 

b ba ab bab- a7} 

C ca ac cac-iaq7} 

ab aba b abab-} (15.292) 
ac aca C acac~} 

bc bca abc bca c—! 6-1 a7! 

abc abca be abca c—! 6-} 
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xb p(x;b) 
b b 

ab ab 

2 ] 

cb bc 
ab? a 

acb abc 
bcb C 
abcb ac 

xC P(x,C) 
C C 

ac ac 

bc bc 

c l 

abc abe 
ac? a 

bc? b 
abc? ab 


537 
u = x,b¢(x,b)— 
] 


l 

h2 

cbc-* 6-3 

ab? a-} 

ach ec} 6-1 q7! 
bcbc! 


abchc— a7! 


(15.296) 


c? (15.29c) 


We list the 17 generators and also their inverses by length and alpha- 
betically for the same length: 


hy 
Xe 
X3 


a2 
b2 

2 

abab-} 

ab? a7} 

acac~} 

ac? a7} 

bab- a7! 
bebe} 

bc? b-} 

cac-+ a7} 

chc—} 6-} 

abc ac} b-} 
abc bc a7} 
abc? b-' a7} 
ach c~ b-! q-! 
be ac1b-1@7! 


a-2 
h-2 

c-? 

aba b-} 

ab-* a7} 

aca—1¢-} 
acw*%q7} 

ba 6-1 q-! 
b¢b¢ 

bc? 6} 
cate a-! 
¢b-1¢-1 §-} 
abca eb! 
abcbo3c—a-} 
abc*b- a7 
acb-'e-1$-1q7! 
bca 3c ba! 


(15.30) 


The square of every element of G lies in H. In particular, let us express 
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(ab -'c)* = ab“! cab~'c in terms of the «’s. We proceed as follows: 


ab-'cab-'e = (ab) (c ab“ ¢) 

= (ab-2a~! - ab) (c ab-' c) 
as l(abc a)(b-} c) 
a5 '(abca c-! b-) - bc) (6-1 €—?) 
== ae Py OCD *) c= 
ae aa bco te ee} 
She Wa heg i Be he 


Hence (ah-!c)? = a, ‘x13 %,2 7, and ifin G we have the relation 


(ab-* ao) =, (15.32 
this becomes in 7 the relation 
(Coane fre: org ee (15.32 


The procedure for converting used in (15.31) may be described in a 
form suitable for computers. ‘The generators of H, the x,, are first com- 
puted, and the «, and their inverses are alphabetized and stored. We 
take an initial segment of ab—!¢ ab-! ¢ as long as possible, which 1s a coset 
representative (here only the letter a), and then also the succeeding 
letter, taking ab-!. Since the added letter is an inverse, we look under 
the inverses of the «’s for a (necessarily unique) word beginning this wav. 
This is «,—! = ab-?a~!. We follow this by correcting terms so that the 
value is unchanged; thus ab-! = ab-*a-1- ab. We then keep the z and 
proceed in the same way with the remaining letters; here abcab-'¢ = 
(abca)(b-! c), since abc is a coset representative but abcais not. The 
same procedure is followed here, and with another step the expression 
as! a15 &, 1 is obtained. 

This procedure, when coset representatives are as short as possible and 
form a Schreier system, will always give a shorter expression for an ele- 
ment of Hin terms of the generators of H than its expression in terms of 
generatorsofG. Thus ab-!¢ ab-'c is of length 6, and a,~! «3 %,.7?} is ot 
length 3. If we are fortunate in our choice, we find an explicit form for 
the elements of H. A favorable case is, of course, that in which # turns 
out to be abelian. 

Next we consider two ways in which numerical analysis can be applied 
to the study of block designs. 

Extensive use of the criterion of Connor mentioned in Sec. 15.1 has 
not as yet been made. Consider the problem of constructing a design 
with parameters 6 = 69,v = 46,r = 9,k =6,4 =1. Itisnot known 
at present whether or not such a design exists. As mentioned in Sec. 
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15.1, if we have ¢ initial trial blocks B,,..., B,andifS,,,j,u = 1,...,¢, 
is the number of objects common to 8B, and B,, then putting 


Cy =(r—K)(r—A), Cy = AR — 1S; = (Fu), (15.33) 


a necessary condition in order that it be possible to complete B,,..., B, 
to a full design is that 
det C, = det |c,,| > 0. (15.34) 


Hence, if we find that certain choices make det |c;,,| < 0, we may ex- 
clude from consideration all sets of initial blocks of this kind. If this 
excludes a sufficient number of combinations, this may provide a method 
of building up the design by using only permissible combinations or, if 
everything is excluded, of showing that no such design exists. 

For the particular design mentioned, since 4 = 1, then S;,,7 4 ucan 
have only the value 0 or 1. Here c,; = 24, ¢;, = 6 if S;, = 0, and 
¢;,, = —3ifS;, = 1. We may divide out the common factor 3 of these 
numbers for purposes of testing (15.34). We thus consider determinants 
of symmetric matrices, diagonal elements being 8, off-diagonal elements 
being 2 for parallel blocks, —1 for intersecting blocks. Thus we find 
with all off-diagonal elements —1 


S 2) 42.6 2 ee & & Sf 
—] 8 
8 
8 ; 
8 -| _ _go 
3 = 9°, (15.35) 
8 
8 
: 8 : 
= A th: oo. BE o eee SD 


and so we cannot have 10 blocks each intersecting the other 9. Indeed, 
we can show that if 9 blocks intersect each other, a further block must 
intersect 6 of these and be parallel to the remaining 3. 

If a large number of these determinants is evaluated, it is to be hoped 
that the information gained will be such as to indicate how to go about 
the construction of this design. Naturally it is to be wished that study 
of a particular design will suggest methods and theorems of general 
application. 

A further attack on block designs is based on the theory of convex 
spaces. 

In Sec. 15.1 it was noted that, if we take 


| = 2 4uti = -ete sD, (15.36) 
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where a,;; = 1 if the zth object is in the jth block and a,; = 0 if not, then 
Lea: +L? = Q =r(xPe +--+ 4+ 4,7) + 2A xx, (15.37: 
<j 
Here, if we make a trial for the first ¢ blocks, we are assuming explicit 
values for L,,..., L,, and we must have 


O=Q-B----—-BH=LR,,4+-::+i. (15.38, 


Connor’s method depends on asserting that, if Z,,,,..., £, exist in 
(15.38), then Q must be a positive semidefinite form. But even stronger 
statements may be made about Q. We use the fact that L,,,,..., 1. 
if they exist, will be linear forms with nonnegative coefficients. Hence 
must belong to the class T of quadratic forms which can be written as the 
sum of squares of nonnegative linear forms. Trivially, the class 7 
consists of semidefinite forms with nonnegative coefficients. But the 
class J’ is even more restricted. For consider the form 


Q = xy? + xg + xg? + xg? + xg? + xyty + pts + POX Qt, + eX s + Kak 
= (xq + Mex, + Vary)? + (x5 + 22x, + Ox)? 

+ a(x, — Vex, — dexg)? + 58 (x_ + x5)?. (15.39 

From (15.39) itis clear that Q is positive semidefinite and has nonnegative 

coefficients. But Q does not belong to the class 7 of forms which are 


sums of squares of nonnegative linear forms. We shall assume that Q 
has such a representation and reach a contradiction. We assume 


Q = >i, y oF = 0, (15.40: 
i=l 
and we let the numbering in (15.40) be such that Z,,..., £, are the 


linear forms 
L=--:- + UgXy + U3Xg + °°", 


in which both x, and x, have positive coefficients. In such an L the co- 
efficients of x,, x,, and x, must be zero, since otherwise, in (15.40), we 
would get one of the terms x,%9, X9%4, X3X, with a positive coefficient con- 
trary to the explicit form of Q in (15.39). Hence 


Li +t + L,2 = Axg? + Yoxgxg + Bxg?. (15.41) 
Thus (15.40) becomes 


Q = Q(X1,%25%3)X 4% 5) 
= Ax,? + Jor xg + Bug? + Q1(%1,%e%3.% 4X5) (15.42: 


where Q0=f0,,4+°°:°4+ 2. (15.43: 
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If we now put x, = ax, + x5, x, = —lex, — ex, x, = —ex, — xy 
in (15.40), then (15.41) is unaffected, and, by (15.39), Q reduces to 
2@(xp + x3)%. Here (15.42) takes the form 


8 (X_ + xg)? = Axg? + Yexerg + Bxy? + Qi(%e,%3), (15.44) 
and Q, is positive semidefinite. Thus0 < A <%,0 <B <%, and so 
% — 4AB > 3%e6 — *%e = Me. (15.45) 


Thus the form on the right-hand side of (15.41) has a positive discrimi- 
nant and is indefinite, conflicting with its expression as a sum of squares. 
We have been led to a conflict by the assumption that Q could be written 
as a sum of squares of nonnegative forms. 

This example shows that the class 7 of quadratic forms which can be 
expressed as the sum of squares of nonnegative linear forms is more re- 
stricted than the intersection of the class D of semidefinite forms and the 
class P of forms with nonnegative coefficients. How are we to recognize 
and make use of this restriction? The theory of convex spaces gives us 
some information on this. With a quadratic form Q(x,,..., *,), 


Q = > d;;x,; C7 Sl eh? Oe = 0), (15.46) 
we associate a point B = (6,;,..., 5,,,) inn®-dimensional space, restrict- 


ing ourselves to the linear subspace for which 6;,; = 6,,. Here the class 
T is a convex cone, and the extreme points of 7 are merely the squares 


of nonnegative linear forms (a,x, + -:- + 4,x,)*,a; >0. The adjoint 
space 7* consists of all points C(c,,,..., ¢,,) with ¢,; = ¢;,; such that 
> ¢,;64; = 0 (15.47) 
tJ 


for every Be T. Since we know the extreme points of 7, this means 


at > ¢,;a;a; > 0 for all a; > 0. (15.48) 
i,j 


In other words, (15.48) says that 7* is the space of quadratic forms non- 
negative for nonnegative arguments. A. Horn has shown that the 
spaces D of semidefinite forms and P of nonnegative forms are each their 
own adjoint. From the general theory of convex spaces, since T <¢ D 
C\ P, it follows that 7* > DUP. Thus 7* contains all semidefinite 
forms and all nonnegative forms, but indeed 7* contains still further 
forms. ‘The symmetric matrix 

] l 1 —-1 —-!l 

1 1 —-!1 1 —1 
K = 1 —1 1 —1 ] (15.49) 

—1] 1 —1 l l 

—l1 —-l l l l 
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corresponds to the form 
K = xy? + x2? + xg? + x4? + x57 + Axx + Qxyxg — Qxyxq — 2x,x5 
— 2x—xX3 + Woxrg — QxQx, — Qxgxq + 2xgx5 + Q2xqx5. (15.50) 
We may express K in two ways: 
K = (x, +X. — %3 — Xq — X5)? + 4xyx3 + 4(%g — X3)Xq, (15.51) 
K = (x; — xX, + X%3 — %4 — %5) + 4x %, + 4(x3 — X_)x5- (15.52) 


For nonnegative x’s we see from (15.51) that if x, > x3, then K > 0, 
whereas, from (15.52), ifx, > x 2, then K >0. Hence in every case K 
is nonnegative for nonnegative arguments. Thus X isin T*, and indeed 
Horn has shown that K is anextreme pointof T*. The form Q of (15.39) 
corresponds to the matrix 

ly | 


l O O 2 2 
0 1% O %&% 
Q=]0 %* 1 % 0 (15.53) 
le 0 &% 1 =O 
wo 0 1 


If we calculate the inner product of the 25 dimensional vectors corre- 
sponding to K and Q, we find 


(K,Q) =5 -2344+%4+%4+%4+%)=—-'. (15.54) 


Hence, since K is a point of 7*, Q is nota point of T. This yields a new 
proof that @ is not a sum of squares of nonnegative linear forms. Thus 
a calculation of pointsC = (c;,) of T* gives the simple linear test (15.47) 
for determining whether a form isin 7. Presumably extreme points of 
T* such as K give the most effective tests. However, once points in 7* 
have been calculated, they may be tabulated in some permanent form 
and will thereafter be available to test large numbers of forms such as Q 
arising in (15.38). The test (15.47) is very simple to apply, once the 
points in 7* have been found. 
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SOME ILLUSTRATIVE COMPUTATIONS IN 
ALGEBRAIC NUMBER THEORY 4y Harvey Cohn 


The object of the first part of this chapter, Secs. 16.1 to 16.3, is to 
focus attention on a special phase of integral numerical analysis [1], 
namely, algebraic-number-theory computations, in which a high degree 
of purity is preserved in machine work through two features: first, the 
machine serves as a “scientific instrument” rather than as a ‘“‘tally 
sheet,” participating in some of the intricacies of the theory [2]; second, 
the machine can use modular arithmetic [3] to treat irrationals exactly, 
without round-off. The subject matter is a little too specialized to be 
treated in detail or depth; we therefore restrict ourselves to illustrative 
samples. 


16.1 Rational Primes 


The best-known algebraic-number-theory computation is a seemingly 
integral one, namely, the testing of certain primes [4]. 

We might first digress briefly to note that the discovery of new primes 
is plagued by the disadvantage of diminishing returns, for the testing 
of a prime P by itself would involve in theory approximately P” trial 
divisions by potential divisors. The object of good machine practice 
is therefore not a tour de force of electronic reliability but the discovery 

943 
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of new infinite classes of prospective primes that can be tested more 
elegantly (in fewer steps). 

Most of the very elegant tricks apply to relatively isolated cases. 
For instance, consider the rather crucial fact that 641 divides 23* + 1, 
first seen by Euler. If one happens to notice that 


641 = 640 + 1 = 5-274 1 
= 625 + 16 = 54 + 24, 
then it is easily seen that 


5-274] = 544+24=0 (mod 641); 


and thus, eliminating the symbol 5 between two congruences, we obtain 
232 + |] =0. Actually, the number 641 was not too hard to locate, 
since it can be shown that the number F = 2? + 1 has as its prime 
divisors only numbers = 1 (mod 2'+?). Yet this is of little help in 
general, since the number of trials is now reduced only slightly, to the 
order F'?/log F. Such clever devices for the so-called Fermat primes 
F do not seem to generalize to machine programs very well [5]. A 
machine-type (but less interesting) criterion of primality is that 
3° —DE = —1 (mod F). A variant of this test which is both compli- 
cated enough to require algebraic number theory and simple enough 
to be transparent will now be applied to a different set of primes. 
The Mersenne primes are those primes P = 2? — | where fis an odd 
prime. The Lucas-Lehmer test [6] for primality is as follows: define 


Uy = 4, Ue Sa ee: (16.15 
then P is prime if and only if 
un» =0 (mod P). (16.14) 
Now this criterion becomes algebraic rather than integral if we define 
wo, = 2 cos 27/2', 


(16.2: 
W, = 2cos7/2 = 0 


and note that as before 
OO, — aan am 2. (16.2a) 


Then w, corresponds to ug and w, to uz». Thus the Lucas-Lehmer 
test states that P is prime if and only if, in some way, w, “represents” 
4 (mod P). We then look into this type of congruence more closelv. 
Specifically, we define an algebraic integer [7] as a number w satisfy 
ing an irreducible monic equation in integral coefficients a;, namely, 


f(w) = wo" — aw"! + aga” * — +++ + (—)"a, = 0. (16.3) 
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Here the conjugate roots have as their product a,, the norm, more 
generally denoted as N(w) =a,. Easily, N(w,w.) = N(w,)N(w.). 
We can then say that an algebraic integer w is represented by an ordinary 
integer u (mod g) when 


N(w —u) =0 (mod 4). (16.4) 


Now it is not true that for every qg there exists a u that will make 
(16.4) valid. Actually, in terms of the defining polynomial for a, 
(16.4) states that 


f(u) =0 (mod q) (16.42) 


for any such u, so that there are at most n of them when q is prime. 
The proof goes precisely as in the case of “numerical” solutions of 
equations. Also, multiple roots can occur only when the ‘discriminant 
is Q”’ in the numerical case or only when g divides the discriminant, or 
the rational integer 

D =TIfo® — w), (16.5) 


taken over distinct (unordered) pairs of conjugates of w. Furthermore, 
in most of the cases under consideration the equation is normal; that is, 
every root is equal to some polynomial in rational coefficients of any one 
root. Thus for a normal algebraic integer, if the prime qg does not 
divide the discriminant, then (16.5) has no roots or exactly n distinct 
roots. 

Now let us apply this development to Mersenne primes. The 
induction of w, in (16.2) implies that w, = 0, w3 = 2'4, a, = (2 + 2’)%, 
until finally, for example, 


w, = {2 + [2 + (2 + 2'4)4]4}'t.-. (p — 2) radicals (16.6) 

or conversely w, satisfies this equation of degree 2?~*: 
fon @) S44 ora 2)F = 2) 2 a 2 Se (p — 2) squares. 
(16.7) 


Thus we can easily see the sufficiency of the Lucas-Lehmer test for pri- 
mality, once we notice that (16.7) is normal. This is seen by writing 
the 2?~? roots as w'” = 2 cos 2mr/2?, where r is odd, 1 <r < 2?7!, 
Then, if s is another odd value and if r = ks (mod 27), we easily can 
express w'” = g[w'"], where cos k0 = g(cos6) by a_ well-known 
trigonometric identity. Furthermore, it can be seen that D is an exact 
power of 2; in fact, writing w = exp 27ir/2? + exp —2z7ir/2”, we 
can verify (see [8]) 

ie aa (16.52 ) 


Suppose now the condition (16.1a) holds; then we show that P has 
no proper prime divisor gq. For, if so, f,_2(u) = 0 (mod q) would have 
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its full quota of 2”? distinct roots (by virtue of the presence of the one 
root u = 4). Then 2?-? [= (P + 1)/4] <q, and thus P must be a 
prime, since it could have no prime divisor g < P* as long as P“% < 
(P + 1)/4 (true when P > 31). 

The sufficiency of the Lucas-Lehmer test is therefore established. 
Unfortunately the necessity goes too deeply into algebraic number theory 
to be appropriate here. We merely note in conclusion that this process 
has the order of log P steps, a considerable improvement [9] over 
P*, and that, incidentally, the advantages of a binary machine become 
manifest when P is represented in digital notation. 


16.2 Units 


The elementary (rational) operations require the introduction, with 
w, of the field R(w), or the set of all quantities that are expressible as 
rational functions of w with rational coefficients. The elements of 
R(w) that are also algebraic integers are called integers of the field R(w). 
They are not necessarily all expressible as a polynomial in w with inte- 
gers for coefficients, but for convenience we consider only fields R(w) 
generated by an w with this property. It can be verified, for instance, 
that for the quadratic field connected with m’* we would have to take 
o = m* when m#1 (mod 4) and w = (1 + m't)/2 when m = 1 
(mod 4) (here m is square-free). Thus the equation in integers (for 
m=7), x — 7y*? = +A, 1s the same as M(w) = +A where w = 
x + y+ 7%, an unknown field integer in R(7%). 

From the last illustration we can see the importance of a special class 
of algebraic integers known as units 4 with N(7) = +1, since two 
solutions of N(é) = +A might be trivially equivalent (7&, = &). 

The problem of finding units in an arbitrary field cannot be dismissed 
as purely mechanical, but the procedure is generally to search for a 
&,,&, with M(é,) = N(&) = Aandask hopefully whether 7 = &,/&, is an 
integer. Rationalizing the denominator by writing 7 = [&, &--- &""]/ 
[&. 4 ---&,], we see that, since the value of the denominator is 
N(é,) = A, the problem is to show that the numerator is a polynomial 
in w whose coefficients are all divisible [10] by A. In other words, the 
numerator can be calculated modulo A to see whether the coefficients 
are congruent to zero. In principle this is the same as asking whether 
E,/& really represents a zero division modulo p for any prime divisor p 
of A or whether cancellation (owing to units) forestalls this zero division. 

A related problem consists of deciding whether or not one unit is a 
power of another (unknown) unit; that is, for a given 7, can we write 


n= +7)" (16.8) 


for an unknown n) andr? In practice the number of different values 


Google 


NUMBER THEORY 547 


of 7 is limited by other conditions beyond this discussion; so we specialize 
to fixed r. For instance, consider the trial 
RST le = (at be 24) =, (16.82) 


We first find a prime p (see [11]) for which 2% represents an odd integer 
and for which the cube-root extraction is unique. We try p = 23, 
since N(5 — 2%) = 23; so now 2% represents 5 (mod 23). Then, if 
No Tepresents uy and 7, represents uo, we find that (16.82) becomes 


7+5:5 =u,? (mod 23), 
7—3:°5 = (%)’, 
and by primitive roots, for example, uy -= +6, uy = —4. 


Hence, solving in terms of the representation of 7) as a + b- 2'?, we 


find 


(16.86) 


a+6-5 =6(= 4) (mod 23), 
a—b:5=—-4(=4%), 
and soa = 1, 6 = 1 (mod 23). Ifwe try other acceptable moduli, like 
47 = N(7 — 2%), we can use the Chinese remainder theorem to find 


that, in no time, a and 64 are determined modulo some enormous integer. 
Here we could either verify the natural guess 


745-2% =(141-2'2)3, (16.82) 


made on the basis of the residues of a and 6, or use some (difficult) 
estimates on the a priori size of a and 5 to determine the actual values 


[12]. 


16.3 Unique Factorization 


(16.8c) 


Now the significance of algebraic number theory lies in the fact that 
the integers of a field do not generally exhibit unique factorization. 
This became clear in the famous case in which Fermat’s last theorem 
[13] was “solved”? by Lamé, Cauchy, and others on the basis of such a 
false assumption. Specifically, if we consider indecomposables to mean 
algebraic integers of the field with no further factorizations (not using 
units), then we cannot assume that two factorizations of a number into 
indecomposables must match factor for factor (ignoring multiplications 
by units). Thus the primes of an algebraic number field are not the 
indecomposables but are specially contrived, so-called zdeal numbers 
(which are too involved to discuss further here). 

Thus, discussing this failure of unique factorization rather than the 
remedy, we might consider the field generated by w = (—5)".. Here 
clearly 

N[—4 + (—5)"?] = 21 = 0 (mod 7), (16.9) 
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or 4 represents w (mod 7). Note that (16.9) gives a factorizauon 
of 21, irreconcilable with 3-7, although all four factors 3, 7, —4. 
+(-—-5)'* are indecomposable (and no units exist to provide cross 
identifications). 

Now it is true, more generally, that unique factorization in a norma! 
field will fail if g is a modulus for which an algebraic integer 1s repre- 
sentable by an ordinary integer whereas no algebraic integer c», exists 


for which 
gq = N(«,). (16.14, 


Here the electronic computer can be put to work contriving long senes 
of fields and primes q for which (16.10) is manifestly impossible and 
hence for which unique factorization must fail [14]. 

In conclusion we might mention a more famous unsolved problem 
which should be amenable to further computation, namely, the proof 
or disproof of unique factorization in fields of the type R(w,), w, = 
2 cos 277/2'. Now Reuschle’s tabulation [15] of complex “primes’”’ shows, 
among other things, that the unique factorization occurs when ¢ = 3, 
4 and is not contradicted by the inadequate evidence for ¢ = 5, 6, 7. 
It turns out, as part of the theory, that it 1s only necessary to test those 
q which are congruent to +1 (mod 2!) and lie below the quantitv 
D*[for D in (16.5a)].. The problem is also one of central interest since 
R(o3), R(w,), R(w5), ..- present an especially important example of a 
set of fields each of which is included in the next by (16.2@). Yet no 
progress has been made since 1875, although a general revival of 
computational interest seems clear from the current literature [16]. 


16.4 Notes 


1. For a bibliography of immediately relevant material, see O. Taussky, Some 
Computational Problems in Algebraic Number Theory, in American Mathemauc2! 
Society, ‘‘Numerical Analysis: Proceedings of Symposia in Applied Mathematics— 
Volume VI,” J. H. Curtiss, ed., pp. 103—108, McGraw-Hill Book Company, Inc.. 
New York, 1956 (see Secs. 16.5 to 16.10). For a more varied survey of computa- 
tional work, see J. Todd, Motivations for Working in Numerical Analysis, in 
“Transactions of Symposium on Computing, Mechanics, Statistics, and Part! 
Differential Equations,” pp. 97-116, Interscience Publishers, Inc., New York, 1933 
(see Chap. 1). 

2. Lest the reader underestimate the intricacies of tallying, he should consu!: 
E. Lehmer, Number Theory on the SWAC, in American Mathematical Society. 
‘‘Numerical Analysis: Proceedings of Symposia in Applied Mathematics—Volume 
VI,” J. H. Curtiss, ed., pp. 187-193, McGraw-Hill Book Company, Inc., New York. 
1956. 

3. The discussion in this chapter presupposes elementary congruence properties 
See, for example, G. H. Hardy and E. M. Wright, “‘An Introduction to the Theory a 
Numbers,’”’ Oxford University Press, New York, 1954. 
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4. A recent picture of the “large-prime’’ competition as well as a fairly complete 
set of references can be found in R. M. Robinson, Mersenne and Fermat Numbers, 
Proc. Amer. Math. Soc., vol. 5, pp. 842-846, 1954. 

5. An effort to generalize this device (of Western) can be found in J. C. Morehead, 
Extension of the Sieve of Eratosthenes to Arithmetical Progressions and Applications, 
Ann. Math., vol. 10, pp. 88-104, 1909. 

6. See D. H. Lehmer, An Extended Theory of Lucas’ Function, Ann. Math., vol. 31, 
pp. 419-448, 1930. 

7. The high degree of abstraction achieved by the algebraic theory of numbers 
produced a remoteness from the examination of integers. For an earlier reference 
work, see H. Weber, “Lehrbuch der Algebra,’”’ II, Vieweg-Verlag, Brunswick, 
Germany, 1899. 

8. See page 756 of the book by Weber referred to in [7]. 

9. A factor of log P must be applied to machine time, owing to the size of P in 
digits (and registers) needed for the calculations modulo P. 

10. It is actually preferable to make A a prime p or power of a prime whenever 
convenient. See H. Cohn and S. Gorn, A Computation of Cyclic Cubic Units, 
J. Res. Nat. Bur. Standards, vol. 59, pp. 155-168, 1957. 

11. In this particular case the reader can check that p = —1 (mod 3) is the 
condition that only one cube root in, say, each congruence (16.85) exist. 

12. A corresponding calculation of cyclic cubic units was made by H. P. F. 
Swinnerton-Dyer using an electronic computer, but the report of this study exists 
only in manuscript. 

13. See G. E. Wahlin and H. S. Vandiver, ‘Algebraic Numbers,” IT, National 
Research Council Bulletin 62, 1928. The work of the latter author in Fermat’s 
last theorem presages more modern computational attitudes. 

14. See H. Cohn, A Device for Generating Fields of Even Class Numbers, Proc. 
Amer. Math. Soc., vol. 7, pp. 595-598, 1956. 

15. See C. G. Reuschle, ‘“Tafeln der Complexen Primzahlen welche aus Wurzeln 
der Einheit gebildet sind,” Berlin, 1875. 

16. For many illustrations of modern advanced computational techniques, see 
H. Hasse, Arithmetische Bestimmung von Grundeinheit und Klassenzahl in zykli- 
schen kubischen und biquadratischen Zahlkérpern, Abh. Deutsch. Akad. Wiss. Berlin. 
Math.-Nat. Kl., 1950. 


SOME COMPUTATIONAL PROBLEMS IN 
ALGEBRAIC NUMBER THEORY* by Olga Taussky 


16.5 Introduction 


It is frequently claimed that many facts in ordinary number theory 
can be fully understood only through their generalization to algebraic 
number fields. <A typical fact is the exceptional role played by the 
prime number 2 in many cases. However, in number fields one proves 
with ease that all numbers 1 — ¢ play an exceptional role when € is a 


* This is an extended version of the article on pp. 187-193 of American Mathe- 
matical Society, ““Numerical Analysis: Proceedings of Symposia in Applied Mathe- 
matics—Volume VI,” J. H. Curtiss, ed., McGraw-Hill Book Company, Inc., New 
York, 1956. 
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root of unity. Another example is the quadratic law of reciprocity, 
for which a really illuminating proof is found only by using number 
fields. Also, the Fermat problem is frequently attacked via number 
fields. 

However, the study of number theory in these fields provides its own 
difficulues and has sull to deal with many open problems. Proagres: 
is particularly hindered by the greatly increased difficulties of numericai 
examples in these fields, as compared to the rational field. 

In this brief report concerning computational problems in algebraic 
number theory, only problems concerning the most fundamental con- 
cepts are mentioned. A list of table work concerning algebraic number 
ficlds—there is not much of it—can be found in D. H. Lehmer [1!. 
Many other problems have come up ‘see, e.g., [2]. 


16.6 Integral Bases 


It is known that for fields of degree > 3 an integral base cannot 
always be found which consists of the powers of a single algebraic integer 
only. Although the existence of an integral base for any field is easily 
established, its construction presents difficulties (see, e.g., [3]). 


16.7 Factorization of Rational Primes in Number Fields 


An ordinary prime number f will, in general, not remain a prime 
number in a given algebraic number field F but will split up into a 
product of powers of prime ideals: 


p= Pte pe 


Apart from a finite number of primes p, namelv, the divisors of the dis- 
criminant of F, we have e, = I. 

The question 1s, What are the possible values of r and of the ¢,? 
Further, since norm p, = fp‘, what are the f,? The laws which govern 
these numbers are not fully known in all fields. However, a great 
number of important facts are known about them, and their structure is 
completely clarified in cyclotomic fields and their subfields. Since the 
extensions of class-field theory to general algebraic extensions have not 
yet been able to clear up the decomposition laws of rational primes in 
arbitrary fields, special numerical work in this connection is very desir- 
able. Kuroda [4+] has computed some results concerning nonabelian 
fields of degree 2". 

Like many other computations in algebraic number theory, the split- 
ting of rational primes can be treated by rational methods only. This 
fact is very important if computation by automatic computing machinerv 
is considered. Only the knowledge of the irreducible polynomial /(xi, 


(Go gle 


NUMBER THEORY 551 


a zero of which generates the field in question, is needed; for the follow- 
ing facts hold for all but a finite number of primes [5]. Let 


f(x) = Py-++-+P,% (mod f), 


where P, is an irreducible polynomial modulo p and P,; 4 P, (mod ), 
13k. Then p splits up in the form 


p — py" or p,; 
where p, ~ p,-_ Ifthe degree of P; is f,, then norm p,; = 
Ore [6, 7] extended the method just described to include all prime 
numbers by considering congruences modulo p" where r is sufficiently 
large. 


16.8 Units 


Other important problems arise in connection with the units in fields. 
To find the units is not always easy. The main problem is to find a 
set of base units. 

In complex quadratic fields there are no units apart from roots of 
unity. In real quadratic fields there is one base unit e and all other 
units are of the form +e",n = 0, +1, +2,.... Ifdis the discriminant 


of the field, then the unit ¢ is of the form (x +_y Vd)/2, where x, y are 
the smallest positive solutions of (x? — dy?)/4 = +1. There is a 
rational routine method for finding e by means of continued fractions. 

Let p > 2 be a prime number. The base unit of the field generated 


by pcan be put into the form (¢ + u Vp)/2. Recently Ankeny, Artin, 
and Chowla [10] inquired whether u + 0(p). They verified this for 
p = 5(8) and p < 2000. This conjecture was later verified by K. 
Goldberg on SEAC up to p < 100,000. 

A routine method for finding a unit in cyclic cubic fields which, 
together with its conjugates, generates all the units was given by Hasse [8]. 

Units in noncyclic cubic fields were treated by several authors (see 
[9], where more references can be found; see also [1]). 


16.9 Ideal Classes and Class Numbers 


Tables for the class numbers of real quadratic fields have been made 
by Ince [11] and for the cyclic cubic fields by Hasse [8]. Hasse has a 
routine method for finding the class numbers in cyclic cubic fields, but 
it is rather complicated. 

If no routine method is aimed at, the work is sometimes simpler. 
A bound for the class number and a method for computing it are given 
by the following known theorem: 

In each class there is an ideal whose norm does not exceed V ‘\d\| where 
d is the discriminant of the field. 
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A sharper bound is (4/z)"#(n!/n") V |d| (see [12]). It is further known 
mat hK = lim (s — 1)£(s), 
s—1 


where f is the class number, and 


21(2Q7)" R 


— oe hae = Vid} 
Here R is the regulator of the field, d the discriminant, w the number of 


roots of unity, r,; the number of real conjugate fields, 27, the number of 


complex ones, and 


(8) =2 (norm a)” 
where a runs through all ideals in the field. (This sum converges for 
all s > 1.) 

Further, there are many facts whose knowledge can cut down the 
work considerably in special cases. Quite a number of facts are known 
about the class number in cyclotomic fields and their subfields. These 
fields have been investigated more closely, partly because they are 
more accessible and partly because of their importance to the Fermat 
problem. Many results concerning class numbers in these fields go 
back to Kummer and to H. Weber. Later Furtwangler [13, 14] 
generalized some of their results; for example, he proved that the class 
number of the field generated by the /’th root of unity is divisible by / 
if and only if the class number of the field generated by the /th root of 
unity is. Further, let f, F be two subfields of the field of the /’th root of 
unity and f¢ F. He then proved that the class number of / divides 
that of F. More recently a book by Hasse [15] appeared which is con- 
cerned with the class number in these fields and their largest real sub- 
fields. It contains many new theorems and tables. 

Scholz [16], Inaba [17], Taussky [18], and others studied the sub- 
fields of prime degree / of cyclotomic fields. The subfield of degree / of 
the field generated by the pth root of unity [pa prime = 1 (/)] has a class 
number prime to /. On the other hand, a subfield of degree / of the 
field generated by the £,/,th roots of unity has always a class number 
divisible by Jif p, = 1(/), p, = 1(J) are two different primes and if the 
field is not contained in the field of the f,th or the p,th roots of unity. 
The class number of such a field is not divisible by /? if one, at least, of 
the two congruences 


x' == pi (po), x’ == p(y) 


has no rational solutions. 
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An example ofsucha caseis/ = 3, p, = 7,p, = 13. This means that 
the class number of a cubic subfield of the field of the ninety-first root of 
unity (which is not a subfield of the field of the seventh or thirteenth 
root of unity) is divisible by 3 but not by 9. For one of these fields it 
will now be shown that its class number is actually 3. 

It can easily be checked that 


f(x) = 8 —7-13x +3-7-13 =0 


has discriminant 1]? - 7? - 13? and that any of its roots 6 defines a cyclic 
cubic field whose discriminant is 7?- 13%. From a refinement of 
Minkowski’s theory (see [19]; also [12], p. 452; for even sharper results, 
see [20]) it follows that for a cyclic cubic field with discriminant D 
there is in every ideal class an ideal a such that 


norma < %VD. 


In our case this gives norma < 20. The prime numbers 3, 11, 19 split 
up into three factors in the field, while 2, 5, 17 remain prime numbers. 
It is therefore only necessary to examine in what classes the prime ideal 
factors of 3, 11, 19 lie. Since the class number is divisible by 3 but not 
by 9, only the class numbers 3, 6, 12, 15 come into question. The class 
numbers 6 and 15 are impossible, since in such a case the 2-class group 
or the 5-class group of the field would have to be cyclic. In this case 
let p be a prime ideal belonging to a class of order 2 or 5. Let, for 
example; the 2-class group be cyclic. In this case we would have 


pi~ p’, 


where 5s is a generating automorphism of the Galois group of the field 
and a is a rational integer. Hence 


3? a’ 
pr ~ p®. 


This implies a? = 1(2), which implies a = 1(2). This means that 
p~ 1. The same argument applies for the 5-class group. 

In order to show that the class number 12 cannot occur, we prove 
that the prime numbers 3, 11, 19 are norms of numbers or that their 
third powers are. For this purpose we compute the norms of some num- 
bers x + _y6 by means of the formula 


norm (x + 76) = x3 — ax*y + bxy? a= cy 


if (+ af? + b6+¢=0. 


Google 


554 SURVEY OF NUMERICAL ANALYSIS 


We obtain 
norm (1 + 6) = a 
norm (2 — 6) = 
norm (5 — 6) = 19, 
as 


norm (3 — 6 


These facts imply that the class number of the field is 3. 

A treatment by rational methods is also possible for the classes, at 
least in many cases [21-23]. Ifthe field admits an integral base which 
consists of the powers of a single number, then there is a one-to-one 
correspondence between the ideal classes and the classes of n x n 
matrices S~1AS, where A is a fixed matrix with f(A) =0. The ele- 
ments @,,, 5,in A = (a,,), S = (s,) are rational integers, and S runs 
through all matrices with |S| = +1. 

In complex quadratic fields the class number exceeds unity, apart 
from a finite number of cases. This was conjectured by Gauss and 
proved by Heilbronn [25]. Heilbronn also proves, with Linfoot [26], 
that for m > 163 at most one further m is possible such that the field 
F (Vm) has class number unity. It is still an open question whether 
there is a further m. Work by D. H. Lehmer [27] indicates that 
probably no further m exists. (For class numbers in noncyclic cubic 
fields, see again [1, 9]). 


16.10 Principal Idealization 


A rather complicated computation concerns the application of the 
following famous theorem of Hilbert (see [28], Theorem 94, Zahl- 
bericht). Let /f be a field and F a relatively cyclic extension of relative 
degree / of f (/is a prime number). Let all prime ideals of f split up in 
F into different prime ideals. Then there exists an ideal in f which is 
not a principal ideal in f but which ts principal in F. Further, that 
ideal in f lies in a class of order /, and the class number of fis divisible 
by /. 

If the class group of fis cyclic, then there is no further problem, but if 
the class group has at least two base classes, then the following problem 
arises: 

Given such an f and F, which class of f ts the one that does go over into the 
principal class in F? 

If f is a quadratic field R(Wm) and / = 2, this is not too difficult. 
However, if / = 3, the difficulties increase. In the first place, one has 
to go a long way to find a field with a 3-class number > 9 and a non- 
cyclic class group. The first imaginary quadratic field with this 
property is F(W —3,299). It has a 3-class number 27. A field with 
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a 3-class number 9 and a noncyclic class group is F(‘ —4,027). For 
this problem, a rational method was also found to succeed (see [28, 29]). 
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This chapter is concerned with: (1) the Gauss-Markoff theorem, 
(2) likelihood-ratio tests of linear hypothesis, (3) the distribution of 
quadratic forms and of the variance ratio, and (4) applications. These 
topics cover much of the theory in applied statistics and are the funda- 
mental theorems on which the widely used analysis of variance is based.* 


THE GAUSS-MARKOFF THEOREM 


17.1 Preliminaries 


Definition 17.1. Let_y be a random variable having the distribution 
function F(y), that is, 


(i) 0<F(>) <1, 

(i) F(—0) =0,  F(+0) =1, 
(111) F(_y) is nondecreasing, 
(iv) F(y) is continuous on the right. 


The expected value of the random variable » is defined by 


EQ) =u =[" dF). (17.1 


* This chapter was written when the author was at the National Bureau of 
Standards. 

+ Added in Proof: Since this article has been written two excellent books have been 
published which cover many of the topics discussed in this chapter. These are 
F. A. Graybill, “An Introduction to Linear Statistical Models,” vol. I, McGraw-Hill 
Book Company, Inc., New York, 1961; and H. Scheffe, ““The Analysis of Variance,” 
John Wiley & Sons, Inc., New York, 1959. 
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Definition 17.2. The variance of y is defined by 


ot = Ey — mw)? =[— (9 — WAFL). (17.2) 


Note that o? = E(»?) — p?. 

Definition 17.3. Let_y; and_7, be random variables having the joint 
distribution F(y,,9;). Then the covariance between y, and _y, is defined 
by 

o;; = cov( ¥,,9;) = El(y; — #0; — 4) 


=| [0. — Bi) (95 — Bs) EFI) (17.3) 


Note that o,; = E(y,9;) — mits. 

Definition 17.4. Let Y’ = ()1,72,--.,53,) be a vector of random 
variables and let pw’ = (44,@9,-..,¢,) where uw; = E(y,). Then we 
write w= E(Y). Also, let o,,; = 0,2 and V be the n x n matrix 
V = (o,;).. Vis termed the variance-covariance matrix of the vector 


Y and is defined by 
var Y=V=E[(Y — p)(Y — p)'] = E(YY’) — wp’. (17.4) 


The following two lemmas will be useful in what follows. 
Lemma 17.1. Let C be an r x n matrix. Then 


(i E(CY) = Cu, 
(11) var CY = CVC’. 


Proof. (1) follows directly from the definition of the expected value 
operator. (11) is proved as follows: 


var CY = E[(CY — Cu)(CY — Cpu)’] 
= CE[(Y — p)(Y — p)']C’ =Cvc’. 
An important case is that in which C = (¢,,¢,,...,¢,) 18a 1 x n vector 
and V=o7J. Then var CY =CC'o? = 0° So Note also from 


(ii) that E[(CY)(CY)’] =CVC’ + (Cy) (Cp)’. 

Lemma 17.2. The expected value of the quadratic form Y'AY, where 
A = (a;;) is ann Xn matrix, is E(Y’AY) = p'Ap + tr AV, where tr AV 
denotes the trace of AV. 

Proof. Since Y’AY = > a,,7;? + >> a,,9,9;, we have 

i t4j 


E(YAY) = > Qi; (04; + Mi”) + >> Qi; (04; + Hib) 
t tf) 
= 2» A; hf; + >> 45 ;9;; 
ij ij 
= pw’Apu + tr ADV. 
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17.2 The Gauss-Markoff Theorem 


The method of least squares has been in use now for over 150 years. 
It is now recognized that Gauss [4], in 1821 (collected works, 1873), 
placed this method on a sound theoretical basis without any assump- 
tions that the random variables follow a normal distribution. Gauss’ 
contribution was somehow neglected until it was “rediscovered”’ by 
Markoff [6]. For a more detailed historical introduction, see Plackett 
[7]. 

Definition 17.5. A function f(Y) of the random vector Y is an 
unbiased estimate of a parameter 6 if E[f(Y)] = 6. When such a 
function f(Y) exists, then 6 is called an estimable parameter. 

Definition 17.6. Let /’ = (1,,4,...,/,) be a vector of constants. 
Then the linear function ZL =/'Y is called the minimum variance 
unbiased estimate of 6 if E(L) = 6 and var ZL = min (among the class 
of all linear estimators). Sometimes this estimate is called the “‘best 
estimate.” 

Theorem 17.1 (The Gauss-Markoff theorem). Let X = (X,,), 1 = 
1,2,...,pj)a=1,2,...,nbeap x n(p <n) matrix of known constants 
with rank p and let B’ = (B;,Be,...,8,) be a vector of unknown parameters. 
Consider the 1 x n vector of random variables Y’ = (54, 9). ++ Jn) Such that 


E(Y) = X’B, 
var Y = o?], 


Then the best estimate of any linear function 6 = l'B, U = (d,ly,..-5l,) 3 
obtained by substituting into 0 the estimates B, obtained by minimizing the sum 
of squares S = (Y — X'B)'(Y — X’B) with respect to each B,;. Furthermore, 
the solutions of the B are obtained by solving the set of p simultaneous equations 


af = XY, (17.6) 


wherea = XX', [The set of equations (17.6) 15 usually termed the normal 
equations. | 

Proof. (i) Consider the linear function 6 = d’Y, where d is deter- 
mined by the conditions (17.5). Since 6 is to be unbiased, we have 
E(6) = d'E(Y) = d’X'B = I'B, which results in 

vx’ =l' or Xd = l. (17.7) 

Using Lemma 17.1, var 6 = d’do®. Thus we wish to minimize d'd 
subject to the condition (17.7). Using the method of Lagrange 
multipliers, let Q = d’d — 2A'(Xd — 1), where A’ = (A,,A5,...,A,). 
Define the operator d/dd by 


(17.5) 


a= (so oO =) 
ad \dd,’ dd,’ °°’ Od,! * 
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Then a necessary condition for Q to be a minimum is that 0Q/dd = 0. 
This results in 


0Q roy — 
a7 = 2d’ — W'X =0 


or 


re (17.8) 


Premultiplying (17.8) by X results in Xd = aA = l, and thus, since 
a = XX' is not singular, 
A =a. (17.9) 


Substituting (17.9) in (17.8) results in 


d =X’, (17.10) 
and since 6 = d’Y, we have 


6 = Ia XY. (17.11) 
However, E(6) = l'a1XE(Y) =I'B. Therefore we can define by 
§ =o XY, (17.12) 


and £ will be an unbiased estimate of 8. Premultiplying (17.12) by a 
results in the normal equations. 

(ii) It remains to show that the same result is obtained if S = 
(Y — X'B)'(Y — X’B) is minimized with respect to 8. The necessary 
condition for S to be a minimum is that 0S/08 = 0. Carrying out the 
differentiation results in 

os 

ap’ 2X(Y — X’B) = 0, 
which produces the normal equations. One can easily show that the 
solution of the normal equations results in minimum S by virtue of the 
identity 


S = (¥ — X’B)'(Y — X'f) + 606 = (Y — X'B)'(Y — X’A), 


where 8B = 6 + 6. 


Corollary 1. 
var 8 = a7o’. 


Proof. Since 8B = a XY, we have, using Lemma 17.1, 


var 8 = var (a-1XY) = (a-4X) var Y (a-1X)’ 
= (aX) (o7J)(X’a-!) = a-o?, 
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Corollary 2. The quadratic form 
= (Y — X’B)'(¥ — X' B) 
n— p 


1s an unbiased estimate of a; that is, E(s?) = o°. 
Proof. We can write 


VY = SX s Yas) Shab. 


g2 


(17.13) 


Therefore ; ; 
E((Y — X°B)'(Y — X’B)] = E(¥’Y) — E(B’a8). 
Using Lemma 17.2, 
E[(Y — X°B)'(Y — X°B)] = [(X'B)'(X'B) + no®] — (B'aB + po?) 
= (n — pot. 
Definition 17.7, The quantity v, =, — > 8,X,, is defined to be 
i=1 
the residual of the ath observation (a = 1,2,...,2). The vectorv = 
(Y — X’A) istermed the residual vector. Since Xv = X(Y — X’B) = 0, 


there are p linear relations among the z residuals. Hence only n — p 
of the residuals are independent. Since the quantity s? can be written 


, 


v'v 
st = —— 
n— p 


the estimate s? is said to have (n — p) degrees of freedom, as it can be 
written as a sum of squares involving only (n — p) variables. 
Remark. An identity which is often used for computing v'v = 
(Y — X°A)'(¥ — X°A) is 
vv = Y'Y — p’XY. (17.15) 


(17.14) 


> 


Proof. 
vy = u'(Y — X'f) = o'Y = (¥ — X'f)'Y 
= Y’'Y — B/XY asv'X' = 0. 


Corollary 3. The residuals v are uncorrelated with B; that is, cov (v,p) =0; 
hence s® is uncorrelated with B. : - 
Proof. Note that E(v) = E(Y — X’B) =0. Therefore cov (z,) 
E[v(B — B)'] = E(vB’) as E(vp') = 0. Since v = Y — Xp 
(I — X’aX) Y, we have 
E(vp') = E{I — X’a7X)Y(¥'X'a)] 
= (I — X'aX)(X' BBX 4+ ol) (X'a7?) 
= (X’BB'X — X’a 3 XX'Bp’X) Xa 
+ 0? (X’a- — X’aXX'a“!) = 0. 
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Theorem 17.2. Let E(Y) = X’B and var Y = Vo?, where V ts positive 
definite. Then the normal equations for estimating B are a*B = XV-Y, 
where a* = XV"X’, 

Proof. Since V is real, symmetric, and positive definite, there exists 
a matrix [ such that V = IT’, where [ is real and nonsingular. 
Consider the transformation Y* = [-!Y, X* = XT’-!, Thenvar Y* = 
T'-1(Vo?)P’-! = Io? and E(Y*) = T-LXY’B = X’*8. Thus the con- 
ditions of the Gauss- Markoff theorem are satisfied by Y*, and the normal 
equations are a*8 = X*Y*, where a* = X*X*' = XV-1X' and X*Y* = 
XV-1Y. 

Corollary 1. The situation often arises where V can be written as the 
diagonal matrix 

ae 0 


Then T = D(oy, og, ..., 6,) and Y* = D-(a,, o2,..., 6,)Y, X* = 
D-*(0,, 6g, ..+5 6,)4- 

Theorem 17.3. Let E(Y) = X’B, var Y = oI, and let the B satisfy the 
restrictions K’B = m, where K' is anr x p matrix (r < p) of rank r which is 
known and m' = (m,,mg,...,m,) 1s a vector of known constants. Then 
the normal equations for estimating B take the form 


eo) -(n) 
K’ i a) \md 

Proof. Our problem is to minimize (Y — X’8)’(Y — X’) subject to 
the r linear restrictions K’B = m. Let dA’ = (A,, A,,...,A,) be a vector 
of Lagrange multipliers and form the function Q@ = (Y — X’p)'(Y — X’B) 
+ 24'(K’B — m). Differentiating Q with respect to f and setting 
the derivative equal to zero results in 


00 


ay = TAY ~ X°A)'X" + 2K’ = 0, 


which can be written, after taking the transpose, as 
aB + KA = XY. (17.16) 


Therefore (17.16), together with K’B = m, is a basis for finding the 
estimate of f. 
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Corollary. Define 


Then 
(1) ¢,ac, = ¢, 
(11) var B = ¢,0° 
(ii) E{(Y — X'B)'(¥ — X'B)] = (n —p + ro? 


a 
Proof. Multiplying | pl by its inverse results in the four equa- 
tions Kk’ 0 


aa, + Kea = 1, (17.174. 
K’c, = 0, (17.170: 
ac, + Kes = 0, (17.1i¢ 
K’cog = 1. (17.17d 


(i) Premultiplying (17.17a) by c, gives 
¢,ac, + ¢,Ke, = cacy, = Cy, 

as ¢,K = 0 from (17.170). 

(ii) The solutions for 8 can be written as 

B = ¢,XY + cym. 
Now, var § = var (c,XY), as cym is constant; therefore 
var (¢,XY) = (¢,X) var Y(e,X)'’ = (¢,XX'c,)o* = c, 0°. 
(111) We can write the residual sum of squares as 
S$ =(Y¥ — XB) (Y — XB) = (¥ — XB) (¥ — XB) HUW 
where By = a LXY, 
W = K'a'K, 
A = W-1(K’B, — m). 
Note that f, =a XY is the estimate of 8 ignoring the restraint 
K'B =m. Furthermore, the quantity 4 is proportional to the devia- 
tions of K’B, from m. Taking the expectation of A, we have E(/} = 
W-1(K’B — m) = 0, and thus 
var A = W-!K' (var B,)KW-! = o?®W-1(K'a1K)W-! = o?W-}, 
Since E{(Y — X'B,)'(Y — X’By)} = (n — p)o*, we find that the 
expected value of S is 
E(S) = (n — p)o? + tr W(var A) = (n —p + r)o*, (17.18) 

as tr W(var A) = o® tr WW"! = ro. 
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Using the relations (17.17a@) to (17.17d), it is easy to show that ¢c, can 
be written 
¢, = a [I — K(K’aK)-!K'a-"] (17.19) 


provided a is not singular. However, if a is singular, we can write a as 
a=a,+ DKK’, (17.20) 


where ay, is nonsingular and D is a nonsingular arbitrary diagonal 
matrix. Then c¢, can be written as in (17.19), except that a)! replaces 
a“1, From a practical point of view, it is often not convenient to solve 
for the f using the (r + p) linear equations of Theorem 17.3. Instead, 
one can solve the set of p linear equations 


ap = XY — DKn, (17.21) 


where the diagonal matrix D is chosen in (17.20), so as to make ay a 
convenient matrix for inversion. | 

In many applications, the matrix of coefficients a = XX’ of the nor- 
mal equations will be singular and have rank p — r. Hence the solu- 
tions for the # will not be unique. This implies that some linear 
functions of f will not be estimable. It is then convenient to let the 
satisfy r arbitrary linearly independent restraints, chosen such that the 
restraints are nonestimable functions of 8, and use the setup of 
Theorem 17.3. 
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LIKELIHOOD RATIO TESTS OF LINEAR HYPOTHESIS 


17.3. Maximum Likelihood Estimates for Normal 
Distributions 
Definition 17.8. The (one-dimensional) random variable y is said to 


have a normal distribution with mean yw and variance o? if its probability 
density function is of the form 


] 
S( 934,07) = (0?2m)—4 exp E 93 (IF u)*}, —O <y Kw, 
If Y’ =(), -,),) follow a normal distribution with E(Y) = X‘p 


2 
oe 


and var ( Y) o*J, then the joint probability density function of Y is 


AV:B.%) = (o%ny-nexp [ — = RAVE = HA) ag ap 


Also any linear function of normally distributed variables will itself 
follow a normal distribution. 

Definition 17.9. The maximum likelihood estimates of the unknown 
parameters B’ = (f;, Bo,.-., B,) and o? are those estimates which maxi- 
mize f (Y 38,07 

Theorem 17.4. The maximum likelihood estimates for B and o? are 
(1) the same estimates for B as obtained from the Gauss-Markoff theorem, 
(ii) 62 = (Y — X’B)'(¥ — X’B)/n. 

Proof. Since L = Inf (Y;8,o7) is a monotonic function of f ( ¥;2,0°), 


it 1s sufficient to maximize L. Thus 


L = — Sinn —ZIno? — 55 (¥ — X'B)'(Y — X’'B), 
OL l1fa : ; ms 
and Sey x a(y—x~)]=0, 01723) 
oL n eae Boe. | 
sa pat gall — XA'(Y — XA) = 0. (17.24) 


It is clear that (17.23) results in the normal equations and that simplify- 
ing (17.24) and substituting # for B give 


(Y= XA (Y — x8) 


n 


Ag 


C= 


Corollary. The value of f (Y 38,07) obtained by substituting the maximum 
likelihood estimates for B and a* ts 


a 


Sf (¥3f,6") = [6?(27)]-™" exp (3) 
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17.4 Likelihood-ratio Tests 


Let Y’ = (91,2) --+ 53,) be a vector of random variables which 
follow the joint distribution P( Y;6), where 6’ = (6,, 05,..., 6,) denotes 
a vector of parameters.* Let Q be the 4-dimensional euclidean space 
of admissible parameter values for 6 and let w be a subspace of Q. 
Often statisticians are interested in testing the hypothesis that the 
parameters 6 lie in the subspace w. This is often referred to as a null 
hypothesis and is written Hy: 6€w. The alternative to H, is that 
6 €Q — w and is often written H,: 0 €Q — wo. 

Definition 17.10. Consider a test of the null hypothesis Hj: 0 € w. 
Define 

P,(¥;6,) = max P(Y;6), 
bew 


P,(Y;6,) = max P(Y;6). 


6EN) 


Thus P,(¥;6,) and P,(¥;6,) are functions only of Y. Also, 6, is by 
definition the maximum likelihood estimate for 6 allowing 6 to take on 
values within w, and 6, is the maximum likelihood estimate of 6 
allowing 6 to take on values over the entire space Q. Then the likeli- 
hood ratio is defined by 

_ Polk 90) 


Qe ON 


P,(Y;6,) ) 


Note that 0 <A <1 because P,(Y;6,) < P,(Y;6,). Note also that, if 
H, is a correct hypothesis, 4 will be near unity, and the closer 4 is to 
unity, the more belief we have in the correctness of Hy. 

Since A depends only on the observations Y, it may be possible to 
find the distribution of A and choose a A = A,, say, such that 


Pr {A <A,| Ap is true} = a, (O0<a< 1). 


Then, to determine whether HH, or H, is supported by Y, we can adopt 
the rule of accepting HyifA > A, and of accepting H,ifa <41,. Adopt- 
ing such a rule involves the probability « that we shall accept H, even 
though H, is the correct hypothesis. The quantity a is termed the level 
of significance of the statistical test and is often referred to as the type I 
error. On the other hand, we might accept H, when H, is correct. 
The probability of an error of this kind is referred to as the type II error. 


* P(Y;0) denotes a probability density function if the distribution is of the 


continuous type; if the distribution is of the discrete type, then P(¥’;0) denotes the 
discrete probability of Y. 
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17.5 Likelihood-ratio Tests of General Linear Hypothesis 


Let E(Y) = X’B and var_y = o*J and assume that Y follows a normal 
distribution. Define 6’ = (f5,8;) where Byis ap — r x 1 vector and £, 
is an r X 1 vector; X’ = (Xo,X,), where Xq is p —r x n and X, is 
rxn. Then E(Y) = XoBy + X,8;. Let us consider the null hypoth- 
esis that 

HA,: B, = 9, —o<f,< +0 


versus the alternative that 


By 6, = 0 we mean that each element of f, is hypothesized to be equal 
tozero. The expression — 0 < By < oo indicates that the elements of 
8, can take any value on the real line. The likelihood-ratio test will be 
derived for testing this null hypothesis. 

Theorem 17.5. The likelihood-ratio test for 


Hy: B, = 0, —- oO <fyg< 0 
1s a monotonic function of 
is qr 
qo|(n — p) 


where ; 
qi = (Bo — Bo)'Xo¥ te BL X,Y, 
go = (Y — X'B)'(Y — X'B) = y’Y — BXY, 


and where f, and f’ = (Bo,B;) are solutions of 
(XoXo) Bo = Xo¥ 


and 


aB = XY, a= XX’, 


respectively. The quantity F is termed the variance ratio or F statistic. 
Proof. The likelihood ratio can be written, by virtue of the corollary 


to Theorem 17.4, as 

Y: ~2 a2\n/2 

om J (¥5 Bo5?) = (2) (17.25) 
I (Y5B,6") 3° 

where 6? denotes the estimate of o? if H, is the true hypothesis and ¢ 


denotes the estimate of o? allowing o? and £ to take on values over the 


entire parameter space. 
If H, is true, E(Y) = X85, and using the Gauss-Markoff theorem 


results in 


2 


(XoXo) Bo ie XoY. 
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The maximum likelihood estimate for oa? is 


ye (LA iB) (Y= Xho) 
n 


On the other hand, if H, is not a correct hypothesis, 
E(Y) = XoBy + X18, = X'B 


and ap = XY, a= XX", 
Here the estimate is ; ; 
Be (y ma X'B)"(y == X'B) 


n 


It suffices, in considering likelihood-ratio tests, to consider any 
monotonic function of the likelihood ratio as the test criterion. In 
particular, let us take 


Oe ieee 7 eg | ee ce 
(CET) 


as the test criterion. Substituting the values for 6? and 6? in (17.26) 
gives the desired result. 


Corollary. The quadratic form q, appearing in the numerator of F can 
be written as q, = B\C\,—~1f,, provided C,, is nonsingular. 


Proof. If we define 
( _ 7 = XX; 
Cio Cy XjXq XX, 
a agi Te eal 
then i. ; 
By CyoXo¥ + Cy XY 
Therefore g, = (By — Bo)'XoY¥ + B,X,Y can be written as 
11 = VX CooXo + AC oX9 + XCoarXy + XC yk — Xo( XoXo) 1 Xo] ¥. 
Now 
BC By = Y (XC oC CoXo + XoCioXy + XICyoX + XC) Y, 
and thus g, = B,C,,73f, if 
X [Coo — (XoXo) Xo = MoCo Cir CioXo- 
Since (XoX5)Coo + (XoX) Cro = I, (17.27) 
(XX) Co ale (XoX7)Cy, = 0, (17.28) 


(17.26) 


\-l 
, 
) Cor = Cro; 
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we can premultiply (17.27) by (X9X,)~} and postmultiply (17.28) by 
C,,7! to obtain 

Coo — (XoXo)* = — (XoXo) MX oX1) Cro (17.29) 
and CoC? = — (XoXo) 71(XoX]), (17.30) 


respectively. Substituting (17.30) in (17.29) results in Cog — (X9XQ)7 
= CyoC1171C19, which proves the corollary. 
Remark 1. Note that the normal equations af = XY can be written 


ee ae = * 
X,X5  X,Xi/\8 AY 
and that, if X,X; = 0, then f, is obtained from the solution of 


(XoXo) Bo = Xo¥. 


Thus £, = fy, and q, can be written as g, = £,X,Y. In this case, £, 
and £, are said to be mutually orthogonal. 

Remark 2. Instead of testing the null hypothesis that a subset of the 
B’s is equal to zero (8, = 0), we may wish to test whether all f’s are 
equal to zero (8 = 0). Then the results of Theorem 17.5 and its 
corollary immediately follow by setting r = 0, modifying the formulas 
so that terms involving a zero subscript are omitted, and removing the 
subscript 1 from the remaining terms; for example, 


q = B’XY = B'aB, 
Go = (Y — X'B)'(¥ — X’f). 


Remark 3. Instead of considering the null hypothesis H,: 8, = 0, 
we may wish to consider the null hypothesis Hy: 8, = £,°, that is, to test 
the hypothesis that the r parameters £, are equal to a vector of given 
constants £,°, where £,° is not necessarily zero for each element. Since 


E(Y) a XB o ie XB; —< oPo = Xi (By a B,°) a X3B,°, 
we can introduce the transformations 


y* = Y — Xip,° 


By = Bos 
By = py — B,° 
and have BY *) = op, exh: 


Therefore a test of Hy: 8, = £,° goes over into a test of H,: Bf = 0, and 
the results of Theorem 17.5 follow, with g, = (8, — By°)’C\,-2(8, — B,°). 
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Remark 4. Often we have the setup that E(Y) = X58, + X’B;, such 
that B, and £, satisfy the equations 


Kobo ca Moy 
KiB, = my, 


where Ko, K; are matrices of dimension r, xX p — r (75 < p — r) and 
1 X r(r, <1), respectively, and mg, m, are column vectors of length 
r,andr7,. We wish to test the null hypothesis H,: 8, = £,°, where m, 
is such that K{B,° = m,. The variance-ratio criteria will then take the 


fia (= &)/(r — Fy) 


— «(n —ptrtn): 


Remark 5. Sometimes the null hypothesis takes the form of specifying 
that, say, 7 linear functions of the 8; are equal to some given constants. 
Let K’ be an r x p matrix and m an r x 1 vector. Then the null 
hypothesis is that K’B = m. In this case the variance ratio takes the 
form 


_(@ = 4h(p 0 
rr 7 a 


where na? = (Y — X’f)'(Y — X’B) and the # are obtained by using 
Theorem 17.3. 


REFERENCES 


The method of maximum likelihood was developed by R. A. Fisher [5-9], 
although Gauss had earlier applied the method to several special cases. This method 
of estimation was subsequently investigated by a series of workers, of whom Doob 
(2, 3], Dugué [4], Hotelling [11], and Wald [20] should be mentioned. The book by 
Cramér [1] summarizes this research. LeCam [14] gives a historical introduction 
as well as a summary of the more recent work on maximum likelihood. 

Likelihood-ratio tests were proposed by Neyman and Pearson [16] in 1928. In 
later papers [17, 18], these investigators developed a systematic approach to the prob- 
lem of testing hypotheses. It was found in nearly all cases that the likelihood-ratio 
test turns out to be the “‘best”’ test. Wilks [21] and Wald [19] wrote basic papers on 
the “‘optimum”’ properties of the likelihood-ratio test. 

The general linear hypothesis was developed by Kolodziejczyk [13]. Other 
expositions can be found in Kempthorne [12], Mann [15], and Wilks [22]. 

1. H. Cramér, ‘Mathematical Methods of Statistics,”’ Princeton University Press, 
Princeton, N.J., 1946. 

2. J. L. Doob, Probability and Statistics, Trans. Amer. Math. Soc., vol. 36, p. 759, 
1934. 

3. J. L. Doob, Statistical Estimation, Trans. Amer. Math. Soc., vol. 39, p. 410, 1936. 

4. D. Dugué, Application des proprietés de la limite au sens du calcul des 
probabilités a l’étude des diverses questions d’estimation, J. Ecole Poly., vol. 3, p. 305, 
1937, 


Google 


572 SURVEY OF NUMERICAL ANALYSIS 


5. R. A. Fisher, On an Absolute Criterion for Fitting Frequency Curves, .\ess. of 
Math., vol. 41, p. 155, 1912. 

6. R. A. Fisher, On the Mathematical Foundations of Theoretical Statistics, 
Philos. Trans. Roy. Soc. London. Ser. A, vol. 222, p. 309, 1921. 

7. R. A. Fisher, Theory of Statistical Estimation, Proc. Cambridge Philos. Soc., 
vol. 22, p. 700, 1925. 

8. R. A. Fisher, Two New Properties of Mathematical Likelihood, Proc. Roy. Soc. 
London. Ser. A, vol. 144, p. 285, 1934. 

9. R. A. Fisher, The Logic of Inductive Inference, J. Roy. Statist. Soc., vol. 98, 
p. 39, 1935. 

10. R. A. Fisher, “Contributions to Mathematical Statistics,” John Wiley & Sons, 
Inc., New York, 1950. 

11. H. Hotelling, The Consistency and Ultimate Distribution of Optimum 
Statistics, Trans. Amer. Math. Soc., vol. 32, p. 847, 1930. 

12. O. Kempthorne, “‘Design and Analysis of Experiments,”’ John Wiley & Sons, 
Inc., New York, 1952. 

13. S. Kolodziejczyk, On an Important Class of Statistical Hypotheses, Biometrika, 
vol. 27, p. 161, 1935. 

14. L. LeCam, On Some Asymptotic Properties of Maximum Likelihood Estimates 
and Related Bayes’ Estimates, Univ. California Publ. Statist., vol. 1, p. 277, 1953. 

15. H. B. Mann, “‘Analysis and Design of Experiments,” Dover Publications, 
New York, 1949. 

16. J. Neyman and E. S. Pearson, On the Use and Interpretation of Certain Test 
Criteria for Purposes of Statistical Inference, Biometrika, vol. 204A, pp. 175, 263, 
1928. 

17. J. Neyman and E. S. Pearson, On the Problem of the Most Efficient Tests of 
Statistical Hypotheses, Philos. Trans. Roy. Soc. London. Ser. A, vol. 231, p. 289, 
1933. 

18. J. Neyman and E. S. Pearson, On the Testing of Statistical Hypotheses in 
Relation to Probability a Priori, Proc. Cambridge Philos. Soc., vol. 29, p. 492, 1933. 

19. A. Wald, Tests of Statistical Hypotheses Concerning Several Parameters when 
the Number of Observations Is Large, Trans. Amer. Math. Soc., vol. 54, p. 426, 1934. 

20. A. Wald, Note on the Consistency of the Maximum Likelihood Estimate, 
Ann. Math. Statist., vol. 20, p. 595, 1949. 

21. S.S. Wilks, The Large Sample Distribution of the Likelihood Ratio for Testing 
Composite Hypotheses, Ann. Math. Statist., vol. 9, p. 60, 1938. 

22. S. S. Wilks, ““Mathematical Statistics,” Princeton University Press, Princeton, 
N.J., 1943. 


DISTRIBUTION OF QUADRATIC FORMS 
AND THE VARIANCE-RATIO TESTS 


17.6 Distribution of Quadratic Forms 


Let Y’ = (91,02). - +53) follow a normal distribution with E(Y) = u 
and var Y = D(o,’,o,?,..., 0,7) = D(o?). 
Definition 17.11. The quantity 


= (¥ = w'D-(o4(¥ ~ p) = § PD 
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is said to follow the chi-square (y?) distribution withn degrees of freedom. 
The probability density function of the y? distribution is 


=] 
ev) =[2"r(3) | atten (-4),  o<2<0, 


and the cumulative function is tabulated in many statistical texts. If 
D(o?) = of, the quantity (Y — u)’(Y — yp) is said to follow a o*y? 
distribution with n degrees of freedom. 

Lemma 17.3. Let A be ann x n symmetric matrix of rankr <n, such that 
r of its characteristic roots are unity and the remaining (n — r) roots zero. Then, 
if Y’ = (91, Joy + + +5) 8 normally distributed with E(Y) = yw, var Y = o7, 
the quadratic form (Y — p)'(Y — p) follows a o*y? distribution with r degrees 
of freedom. 

Proof. There will exist an orthogonal matrix I such that 


I, 0 
rar = ( ) 
0 0 


Hence, if we introduce the transformation 
Z=T(¥ —»), 
Z will follow a normal distribution with parameters 
E(Z) = 0, 
var Z = E[ZZ'|] = WE((Y — w\(Y — pw)‘ = ol. 
The quadratic form can then be written 


(Y — p)'A(Y — p) = ZT’ATZ = re 


Remark. A necessary and sufficient condition that the charac- 
teristic roots of A be either 0 or 1 is that A be idempotent, that is, 
A? =A. 

Theorem 17.6. Let Y’ = (5, yo, ..-5.9',) be normally distributed with 
E(Y) = wand var Y = ol, Consider 


(Y = w)'(¥ 4) = Fao 


where q; = (Y — w)'A,(Y — pw) and the A, are real symmetric matrices. 
Then any one of the following three properties, 


(1) Dn, = 7 (n; = rank of A,), 
(ii) APSA. S10 h), 
(iii) A,A,=0  foralli #j, 
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implies the other two, and these conditions are necessary and sufficient in order that 
the quadratic forms g, follow independent oy? distributions. 

Proof. Necessity. Ifthe q, follow o?y? distributions, then each 4, has 
nonzero roots equal to unity. Since 


The A, are idempotent, by the previous remark, and 


P= DAP + SAA, 


tt) 
Taking traces we have 


n= \YtrA, + Dd tr AA, = Dn, + LD tr 4,4; 


But tr A,;A; > 0; hence tr A'A; = 0, which imphes 4,4, = 0, as A, 15 
idempotent. 


Sufficiency. (i) Assume 2n,; =n. There will exist an orthogonal 
matrix [I such that 


0 oO} 
k 
34,)r 
=2 


I, 0 
At = Dy =( : 


Then P= p-4 r’( 


and r(y4)r -(, . } 


n—n, 


Since A, = ['D,I’, we have A,? = A,, which shows that A, is idem- 

potent. It follows directly that the same transformation reduces each 

A, to a diagonal matrix, which implies that all A, will be idempotent. 

The proof that 4,A; = 0 follows the same lines as in the necessity part. 
(ii) Assume A,? = A,; then, since J = ZA,, 


n= > trA,= > n,. 


(iii) Assume 4,4, = 0 fort #7. Taking powers of J = ZA, results 
in ] = } A,‘, and hence n = X tr A,*. This is possible only if every 


t 
nonzero root of A, is unity. Thus the A, are idempotent, and it follows 
that n = Xn,. 
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17.7 Variance-ratio Tests for the General Linear 
Hypothesis 
Definition 17.12. Let g, = Y’A,Y and q, = Y’A,Y be quadratic 
forms in Y which follow independent o*y? distributions with degrees of 
freedom 1, and »,, respectively. Then the ratio 
F -_ | 
Jel Ve 


is said to follow a variance ratio or F distribution having the probability 
density function 


= | 


The cumulative distribution of the variance ratio is tabulated in almost 
all statistics texts. 

Theorem 17.7. The quadratic forms of Theorem 17.5, that is, q, = 
(Bo — Bo)’'XoY + ByXiY and qg = (Y — X'B)'(Y — X'B), follow indepen- 
dent acsquare distributions with r and (n — p) degrees of freedom, respectively. 
Hence F = (q,/r)/[q2/(n — p)] follows a variance-ratio distribution and can be 
used to test the null hypothesis Hy: B, = 0. 

Proof. Define Cy = (X9X5)—1 and Cog, Co;, Cy9, Cy, as in the corollary 
to Theorem 17.5. Hence 


(4 : Ee + aged 
By \CioXo¥ + CuXY/ 
and g, and q, can be written as 
9, = Y'(a, — %)¥ = Y'A,Y, 
Jo = YI —a)¥ = YAY, 
Cay: -Cui\yi A 
where a, = (X% x) ia "}( ‘) 
Cio Cu/\Xy 
te = Xy(XoXq) 7X5. 
Pre- and postmultiplying the identity 
[ = (a, — a) + UZ — a) + a = A, + A, + A; 
by (Y — X’f)' and (Y — Xf), we obtain 
(Y — X°B)"(¥ — X'8) = at + gf + 93, 


qt = (Y — X’B)'A(Y — X'B) fori = 1, 2,3. 


where 
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We now show that 4,, 4,, and A, are idempotent and that hence, by 
Theorem 17.6, gf, g3, and qf follow independent o*y? distributions. 
Multiplying x, by x. gives 
A, %q = XC ooX Xoo Xo a XC yo X pXoCoXo a XC AAyCoX 9 

+ ACA XOX 
SK (Ceg VM oC NN CG eC a ae ey, 
= MyCoXg = 2. 
We can similarly show that 2,7 = 2, a? = x); hence 
Ay? Se oe 2h ee ee SA 
Ao = (a Sl ay Sa, 
Ay? = 2,7 = 22 = As, 
and all the A, are idempotent. Therefore gf, gz, and gf follow inde- 
pendent chi-square distributions. 


It still remains to show that g, = gf andq, =q3. IfH pis true, 3, = 0. 
and we can write gf as 


Qi = (Y — Xopo)'AV(Y — XoBo) 
2 YAY — BX ALY — Y'ANGBy + BiXot Node. 
However, X,4, = 0; that is, 
Nyy = (MpAQCoo + AoC 9) Xo + (XoXoCor + Aoki Cy) -¥y — XM 
= Ng = A,-=' 0, 


and thus gf = q, only when H, is true. 
Considering g3 = (Y — X’p)'A,(Y — X'p), we have 
gs = Y'A,Y — p'XA,Y — Y'A,X'B + p'XA,X'B. 
We can similarly show that XA, = 0; therefore, gj = g,. The impor- 
tant point to note is that gf = q, only if the null hypothesis is true, 
whereas 93 = g, regardless of the truth of the null hypothesis. 


Thus, if H, is true, F = (q,/r)/[q¢o/(n — p)] follows a variance-ratio 
distribution. 


REFERENCES 


Theorem 17.6, in the form quoted above, was given by Lancaster [4]. It is based 
on related work by James [5]. The condition > n, = nwas first proved by Cochran 


t 
[2], in 1934, and is sometimes referred to as the Fisher-Cochran theorem or as 
Cochran’s theorem. Other discussions can be found in Cramer ([{1], pp. 116-116: 
and Wilks ({6], pp. 105-108). 
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The distribution of Z = 4 In F was first found by Fisher [3]. In practice it is 
more convenient, for calculation purposes, to use F. 
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Princeton, N.J., 1946. 
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6. S. S. Wilks, ‘“‘Mathematical Statistics,’ Princeton University Press, Princeton, 
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APPLICATIONS 
In this part we illustrate the theory of the preceding sections. 


17.8 Linear Regression 


A frequent problem is that of fitting a straight line to a set of measure- 
ments or observations. Given Y’ = (9, y9,...,5.¥,), we desire to fit a 
straight line, assuming that E(y,) = By + B(x, — %) (a = 1,2,...,n), 
where ¥ = (1/n) > x, and var Y = o%J. Applying Theorem 17.1 (the 


Gauss- Markoff theorem), we have 


l Xo aes x 
X' = 
I x, —<* 
Therefore 
n 0 LIs 
a=XX' = |, XxY¥= sd, 
0 ¥(% — ¥) ¥ (re — a), 
The inverse of a is 
l 
- 0 
ai-|{” 
0 
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ef =a XY, we have 


| ; 
Bo ne =); 
cad (a HIS 
cae CES 


Using Eq. (17.15), 
[> (x, a x)y,}* 


a 


vv = Y'Y — pXY = > Gv. — 3)? 


and thus the estimate of o* is 


‘ , 


= 


n—-p n—2Q 
The estimates for variances and covariances of f, and f, can be obtained 
from var 8 = a~'o* by substituting s? for o?; for example, 


2 2 
es S es S 

estimated var 8, = —, estimated var 8; = =<, , 
n > (x, =a x) 


cov (Bo,B1) =F 0. ; 


Note that the intercept of the linear equation is a linear function of 5, 
and f,, « = By — B,x. Hence the best estimate of the intercept is 
& — f, — B,x, and its variance is 


var & = var fy + * var B, — 2% cov (Boss) 


A common statistical test is to test whether the slope is equal to a given 
constant, that is, whether 8, = p,°. This is an application of Theorem 
17.5 with p = 2, r = 1, where 

aed ee eee Ny 2s oh te ee 
Remark 3 applies to this problem; therefore 


4, = (By — By) Cy (A, — 21°)- 


However, since Vy.¥, = 0,C,,7! = ¥ (x, — 2%, and thus 
z 


m= Da - reek ra eer — ~,%)] = > C Pee i = pos 
z 


z 


t® 


The variance-ratio test is then 
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17.9 Multiple Regression 


A generalization of the preceding example is to fit a function of the 
form 


k 
E(ya) i Bo +2 BX 


Note that this is a fairly general function, for if x,, = ¢,', we have a poly- 
nomial, or x,, can refer to trigonometric terms, etc. No loss in generality 
" 


results if we allow > x,, = 0 for alli. Setting 


a=] 


X, =(1,1,...,1), X=(te) (6 =1,2,..-,k) a =1,2,...,2), 
B’ = (A. Ba; Seg Bx), 
E(Y) = XoBy + X°B. 


we can write 


Since X,X’ = 0, the normal equations can be written 
( ex)le) - (5 
0 XX)\B) \ xy] 


Bo = J; B = (XX')7 XY. 


and hence 


Often we should like to test the null hypothesis that 8; = 8,° for: = 1, 
2,...,. Using Theorem 17.5 we have the variance ratio 


ip _ UB = 6) (XX")(6 — BY 
| yy — nj? — XY). 


~ n—k—1 


where s° 


17.10 Testing k-population Means 


Let E(9;;) aa B, for) = I, 2; oe 8 ,n;t ae 12; oS. , ik. Let Y; — (Jar 


J i29 rated sD anys Y’ = (Y"’,, Y’,, eeey I); B’ me (Bi, Be, Bs ohare p Pa): then E(Y) 
= X’'B, where 


10: 0 

0 1 0 

0 0 0 
Ys cA i 

0 0 1 
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re vectors of n elements consisting entirely of Os or Is, respec- 
Consider the null hypothesis that all 8; are equal—for example, 


Hy (PB; = Bp = +--+ = B;)- 


The normal equations over 22 are 


| fie a 
ly, 
nlp = 
l’Y, 
P lL .. . 
or Be = oY, = Ji 
] P ] 
a) rie — fA’ ee > — p.j2 
Then, =F (Y'Y — BXY) in & & Vis Ji) 


with k(n — 1) degrees of freedom. Similarly, if Hj is true, E(Y) = X’Ip. 
The normal equations are 
nkB =VY,4+ VY,+---+V FY, 
2 dis ; 


or B= OE =); 


sci Hae > ee 
kn 1 j 


with (kn — 1) degrees of freedom. Using (17.26), the variance-ratio 
test 1s 


(a2 — 6%)/(k — 1) _ nr — #7} — 1) . 
# FY Gs — Ie 1) 


17.11 Block Designs 


A common type of experiment is that in which one is investigating the 
effects of several treatments on a set of experimental units. The term 
treatment is a generic term which refers to different methods, processes, 
materials, etc. Ifthe measurements are taken such that they come in 
relatively homogeneous groups, the group of measurements is termed a 
block. Let there be v treatments and 6 blocks such that the :th treatment 
is measured 7, times and there are k experimental units in each block. 
Let y,; refer to the measurement of the ith treatment in the jth block 
where y,, is defined only if treatment 1 is measured on one of the exper?- 
mental units of the jth block. Further assume that no treatment 1s 


fF= 


(0 gle 


LINEAR ESTIMATION AND RELATED TOPICS 581 


measured more than once in the same block. We assume that the 
mathematical model underlying this experimental situation is 


EGD2GS6, CH Gani Sho. 
var) = 0°, cov (its Dee) a 0, 1A 5,J F t, 


where ¢, is the effect of the :th treatment and 5, is the effect of the jth 
block. 
Define 
nal if treatment 7 is measured in block 7 
i" |0 otherwise. 


Thev x 6 matrixn = (n,,) is called the incidence matrix and summarizes 
the experimental configuration. Thus if 1 is a column vector of appro- 
priate dimensions having only unit elements, we have 


Ln atest fe) St 


n'l = kil. 
Define 
] if treatment 7 is measured in the sth experimental unit 
m,,{) = of block 7 
0 otherwise, 


and let m, = (m,,"), (@ = 1,2,...,¥35 =1,2,...,) be a set of 
v x k matrices. Also, let y,; denote the k x 1 vector of the measure- 
ments in the jth block. ‘Then the experimental setup can be written as 


yy, m i 10: 0 


Ve m, | O 1: 0 
ey |= WG) 


where?’ = (t,, tg,...,¢,) and b’ = (b,, b,...,4,). The normal equations 
can then be written as 


> mm; m,1 m,1 > my, 
j j 
I'm; () ft Pa 
"ti 
kI 
I’m, 1’y, 
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Note that the term } m,m; can be written as the diagonal matrix, 


nh 0 


Also I’m, is simply the jth column of the incidence matrix n. The right- 
hand side can also be simplified by observing that m, y, is the measure- 
ment of the zth treatment in the jth block, and thus the :th element of 
> m;.7,; is the sum of all measurements having the ith treatment. We 


denote the ith treatment total by 7; and the jth block total by B, and set 
T’ =(7\, T,,..., T,), B’ = (B,, B,,..., B,). Then the normal 


equations can be written as 


es - ( 7 A : 


The matrix of coefficients will always be singular, and it is possible to 
show that it will have rank 6 + v — 1. To make the solutions unique, 


Therefore, the K’ matrix of Theorem 17.3 is the 2 x 8 + v7 matrix 


l’ 0 
K'= 
sd 


Using Eq. (17.20) with D = D(r), J = (11’), and 


J 0 
6 9 


we arbitrarily let t satisfy 14 =0. (The parameter 1’¢ is nonestimable.} 
0 0 


we can write 


- (I+J) n 
ay = | 
n' kI ! 
Hence the inverse of ay is 
7 e a 
c= 
Cy OC; 
where Cy, = (Di) — = + D(r)J} 
Cn : 
Cy. = — 3 ’ Co, = Che 
l n'C yn 
Ca q+"). 
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Thus the solutions for ¢ and 6 are 
; n 
i=¢.(7 22) =c,@, 
b=-—(B—n't). 


Formula (iii) of the corollary to Theorem 17.3 enables one to calculate 
the estimate of o?. Thus 


] ’ “a, , 
= a (Eoin, - FT - 8) 
= ] (y —1'Q =) 
> phased eo ke 


with (6k — 6 —v + 1) degrees of freedom. | 

Often we are interested in testing the null hypothesis that all treat- 
ment effects are zero, that is, Hy: (¢; = tg = ++: t, = 0) regardless of the 
values of the 6,. This is an application of Theorem 17.5. Using the 
corollary to this theorem, the appropriate variance-ratio test is easily 


seen to be 
pe. PCu@ = 1) 


s2 


with degrees of freedom equal to (v — 1) and (6k —b —v + 1). Since 
t = C,,Q, this variance ratio is usually written in the form 


Oly = 1) 


52 
REFERENCES 


Wilks ({3], Chap. IX) is a good reference for illustrating many applications of 
normal regression theory. ‘The procedures used in testing linear hypothesis, as in 
Secs. 17.9 to 17.11, are usually known as the analysis of variance. This type of 
procedure was first introduced by R. A. Fisher and has been generalized by many 
workers. The reader may refer to Kempthorne [1] for a general exposition of block 
designs or to the paper by Tocher [2] for a concise, but complete, treatment. The 
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which is devoted primarily to the analysis and construction of block designs. 
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17.12 Some Remarks on Computations with High-speed 
Computers 


Applying the theory of least squares to large sets of data or to model: 
with a large number of parameters has, until the advent of high-speed 
electronic computers, been restricted because of the large amount cf 
calculations necessary. However, with the increased use of high-speed 
computers, these calculations are now beginning to appear routine. 

The essential numerical operations in applying the theory of least 
squares are: (1) a matrix multiplication to obtain the normal equations, 
(2) the solution of a set of simultaneous linear equations and the 
inversion of the matrix of normal equations, and (3) some auxilian 
computations important to the user, such as the calculation of each 
residual. Experience with computers has shown that, with respect to 
general applications, it is more efficient to program the least-squares 
computations using an orthonormalizing process. This will usually 
result In more accurate calculations and reduce the round-off error. 
For this purpose the Gram-Schmidt process is particularly attractive a: 
it can be programmed in a recurrence form. The paper by Davis and 
Rabinowitz [1] outlines these procedures with reference to electronic 
computers. See also Chap. 10. 

With more specialized applications such as the analysis of variance, it 
may be more convenient to use a special analysis of variance program. 
rather than the more general orthonormalization routine. Calculating 
analysis of variance problems on high-speed computers mav _ be 
particularly appropriate when there is a large amount of data from a 
single source or when there are many sets of data to be analyzed in the 
same manner. The papers by Hartley [2] and Yates, Healy, and 
Lipton [3] discuss general methods for programming analysis of 
variance computations on a computer. 
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Differential equations, partial, elliptic 
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Matrices, permutation, 245 
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unitary, 281 
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linear, 327 
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